diff --git a/projects-appendix/modules/ROOT/attachments/project_template.ipynb b/projects-appendix/modules/ROOT/attachments/project_template.ipynb
deleted file mode 100644
index 550d20aed..000000000
--- a/projects-appendix/modules/ROOT/attachments/project_template.ipynb
+++ /dev/null
@@ -1,190 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "id": "be02a957-7133-4d02-818e-fedeb3cecb05",
- "metadata": {},
- "source": [
- "# Project X -- [First Name] [Last Name]"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a1228853-dd19-4ab2-89e0-0394d7d72de3",
- "metadata": {},
- "source": [
- "**TA Help:** John Smith, Alice Jones\n",
- "\n",
- "- Help with figuring out how to write a function.\n",
- " \n",
- "**Collaboration:** Friend1, Friend2\n",
- " \n",
- "- Helped figuring out how to load the dataset.\n",
- "- Helped debug error with my plot."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "6180e742-8e39-4698-98ff-5b00c8cf8ea0",
- "metadata": {},
- "source": [
- "## Question 1"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "49445606-d363-41b4-b479-e319a9a84c01",
- "metadata": {},
- "outputs": [],
- "source": [
- "# code here"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "b456e57c-4a12-464b-999a-ef2df5af80c1",
- "metadata": {},
- "source": [
- "Markdown notes and sentences and analysis written here."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "fc601975-35ed-4680-a4e1-0273ee3cc047",
- "metadata": {},
- "source": [
- "## Question 2"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "a16336a1-1ef0-41e8-bc7c-49387db27497",
- "metadata": {},
- "outputs": [],
- "source": [
- "# code here"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "14dc22d4-ddc3-41cc-a91a-cb0025bc0c80",
- "metadata": {},
- "source": [
- "Markdown notes and sentences and analysis written here."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "8e586edd-ff26-4ce2-8f6b-2424b26f2929",
- "metadata": {},
- "source": [
- "## Question 3"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "bbe0f40d-9655-4653-9ca8-886bdb61cb91",
- "metadata": {},
- "outputs": [],
- "source": [
- "# code here"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "47c6229f-35f7-400c-8366-c442baa5cf47",
- "metadata": {},
- "source": [
- "Markdown notes and sentences and analysis written here."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "da22f29c-d245-4d2b-9fc1-ca14cb6087d9",
- "metadata": {},
- "source": [
- "## Question 4"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "8cffc767-d1c8-4d64-b7dc-f0d2ee8a80d1",
- "metadata": {},
- "outputs": [],
- "source": [
- "# code here"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0d552245-b4d6-474a-9cc9-fa7b8e674d55",
- "metadata": {},
- "source": [
- "Markdown notes and sentences and analysis written here."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "88c9cdac-3e92-498f-83fa-e089bfc44ac8",
- "metadata": {},
- "source": [
- "## Question 5"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "d370d7c9-06db-42b9-b75f-240481a5c491",
- "metadata": {},
- "outputs": [],
- "source": [
- "# code here"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9fbf00fb-2418-460f-ae94-2a32b0c28952",
- "metadata": {},
- "source": [
- "Markdown notes and sentences and analysis written here."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "f76442d6-d02e-4f26-b9d6-c3183e1d6929",
- "metadata": {},
- "source": [
- "## Pledge\n",
- "\n",
- "By submitting this work I hereby pledge that this is my own, personal work. I've acknowledged in the designated place at the top of this file all sources that I used to complete said work, including but not limited to: online resources, books, and electronic communications. I've noted all collaboration with fellow students and/or TA's. I did not copy or plagiarize another's work.\n",
- "\n",
- "> As a Boilermaker pursuing academic excellence, I pledge to be honest and true in all that I do. Accountable together – We are Purdue."
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "f2022-s2023",
- "language": "python",
- "name": "f2022-s2023"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.10.5"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/projects-appendix/modules/ROOT/attachments/think_summer_project_template.ipynb b/projects-appendix/modules/ROOT/attachments/think_summer_project_template.ipynb
deleted file mode 100644
index 411122592..000000000
--- a/projects-appendix/modules/ROOT/attachments/think_summer_project_template.ipynb
+++ /dev/null
@@ -1,190 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "id": "be02a957-7133-4d02-818e-fedeb3cecb05",
- "metadata": {},
- "source": [
- "# Project X -- [First Name] [Last Name]"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a1228853-dd19-4ab2-89e0-0394d7d72de3",
- "metadata": {},
- "source": [
- "**TA Help:** John Smith, Alice Jones\n",
- "\n",
- "- Help with figuring out how to write a function.\n",
- " \n",
- "**Collaboration:** Friend1, Friend2\n",
- " \n",
- "- Helped figuring out how to load the dataset.\n",
- "- Helped debug error with my plot."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "6180e742-8e39-4698-98ff-5b00c8cf8ea0",
- "metadata": {},
- "source": [
- "## Question 1"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "49445606-d363-41b4-b479-e319a9a84c01",
- "metadata": {},
- "outputs": [],
- "source": [
- "# code here"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "b456e57c-4a12-464b-999a-ef2df5af80c1",
- "metadata": {},
- "source": [
- "Markdown notes and sentences and analysis written here."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "fc601975-35ed-4680-a4e1-0273ee3cc047",
- "metadata": {},
- "source": [
- "## Question 2"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "a16336a1-1ef0-41e8-bc7c-49387db27497",
- "metadata": {},
- "outputs": [],
- "source": [
- "# code here"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "14dc22d4-ddc3-41cc-a91a-cb0025bc0c80",
- "metadata": {},
- "source": [
- "Markdown notes and sentences and analysis written here."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "8e586edd-ff26-4ce2-8f6b-2424b26f2929",
- "metadata": {},
- "source": [
- "## Question 3"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "bbe0f40d-9655-4653-9ca8-886bdb61cb91",
- "metadata": {},
- "outputs": [],
- "source": [
- "# code here"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "47c6229f-35f7-400c-8366-c442baa5cf47",
- "metadata": {},
- "source": [
- "Markdown notes and sentences and analysis written here."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "da22f29c-d245-4d2b-9fc1-ca14cb6087d9",
- "metadata": {},
- "source": [
- "## Question 4"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "8cffc767-d1c8-4d64-b7dc-f0d2ee8a80d1",
- "metadata": {},
- "outputs": [],
- "source": [
- "# code here"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0d552245-b4d6-474a-9cc9-fa7b8e674d55",
- "metadata": {},
- "source": [
- "Markdown notes and sentences and analysis written here."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "88c9cdac-3e92-498f-83fa-e089bfc44ac8",
- "metadata": {},
- "source": [
- "## Question 5"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "d370d7c9-06db-42b9-b75f-240481a5c491",
- "metadata": {},
- "outputs": [],
- "source": [
- "# code here"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9fbf00fb-2418-460f-ae94-2a32b0c28952",
- "metadata": {},
- "source": [
- "Markdown notes and sentences and analysis written here."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "f76442d6-d02e-4f26-b9d6-c3183e1d6929",
- "metadata": {},
- "source": [
- "## Pledge\n",
- "\n",
- "By submitting this work I hereby pledge that this is my own, personal work. I've acknowledged in the designated place at the top of this file all sources that I used to complete said work, including but not limited to: online resources, books, and electronic communications. I've noted all collaboration with fellow students and/or TA's. I did not copy or plagiarize another's work.\n",
- "\n",
- "> As a Boilermaker pursuing academic excellence, I pledge to be honest and true in all that I do. Accountable together – We are Purdue."
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "think-summer",
- "language": "python",
- "name": "think-summer"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.10.5"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/projects-appendix/modules/ROOT/examples/10100-2022-projects.csv b/projects-appendix/modules/ROOT/examples/10100-2022-projects.csv
deleted file mode 100644
index ba5dc1353..000000000
--- a/projects-appendix/modules/ROOT/examples/10100-2022-projects.csv
+++ /dev/null
@@ -1,14 +0,0 @@
-Project,Release date,Due date
-xref:fall2022/10100/10100-2022-project01.adoc[Project 1: Getting acquainted with Jupyter Lab],August 22,September 9
-xref:fall2022/10100/10100-2022-project02.adoc[Project 2: Introduction to R: part I],August 25,September 9
-xref:fall2022/10100/10100-2022-project03.adoc[Project 3: Introduction to R: part II],September 8,September 16
-xref:fall2022/10100/10100-2022-project04.adoc[Project 4: Introduction to R: part III],September 15,September 23
-xref:fall2022/10100/10100-2022-project05.adoc[Project 5: Tapply],September 22,September 30
-xref:fall2022/10100/10100-2022-project06.adoc[Project 6: Vectorized operations in R],September 29,October 7
-xref:fall2022/10100/10100-2022-project07.adoc[Project 7: Review: part I],October 6,October 21
-xref:fall2022/10100/10100-2022-project08.adoc[Project 8: Review: part II],October 20,October 28
-xref:fall2022/10100/10100-2022-project09.adoc[Project 9: Base R functions],October 27,November 4
-xref:fall2022/10100/10100-2022-project10.adoc[Project 10: Functions in R: part I],November 3,November 11
-"xref:fall2022/10100/10100-2022-project11.adoc[Project 11: Functions in R: part II]",November 10,November 18
-"xref:fall2022/10100/10100-2022-project12.adoc[Project 12: Lists & Sapply]",November 17,December 2
-xref:fall2022/10100/10100-2022-project13.adoc[Project 13: Review: part III],December 1,December 9
diff --git a/projects-appendix/modules/ROOT/examples/10200-2023-projects.csv b/projects-appendix/modules/ROOT/examples/10200-2023-projects.csv
deleted file mode 100644
index 92e2679c9..000000000
--- a/projects-appendix/modules/ROOT/examples/10200-2023-projects.csv
+++ /dev/null
@@ -1,15 +0,0 @@
-Project,Release date,Due date
-xref:spring2023/10200/10200-2023-project01.adoc[Project 1: Introduction to Python: part I],January 9,January 20
-xref:spring2023/10200/10200-2023-project02.adoc[Project 2: Introduction to Python: part II],January 19,January 27
-xref:spring2023/10200/10200-2023-project03.adoc[Project 3: Introduction to Python: part III],January 26, February 3
-xref:spring2023/10200/10200-2023-project04.adoc[Project 4: Scientific computing & pandas: part I],February 2,February 10
-xref:spring2023/10200/10200-2023-project05.adoc[Project 5: Functions: part I],February 9,February 17
-xref:spring2023/10200/10200-2023-project06.adoc[Project 6: Functions: part II],February 16,February 24
-xref:spring2023/10200/10200-2023-project07.adoc[Project 7: Scientific computing & pandas: part II],February 23,March 3
-xref:spring2023/10200/10200-2023-project08.adoc[Project 8: Scientific computing & pandas: part III],March 2,March 10
-xref:spring2023/10200/10200-2023-project09.adoc[Project 9: Scientific computing & pandas: part IV],March 9,March 24
-xref:spring2023/10200/10200-2023-project10.adoc[Project 10: Importing and using packages],March 23,March 31
-"xref:spring2023/10200/10200-2023-project11.adoc[Project 11: Classes, dunder methods, attributes, methods, etc.: part I]",March 30,April 7
-"xref:spring2023/10200/10200-2023-project12.adoc[Project 12: Classes, dunder methods, attributes, methods, etc.: part II]",April 6,April 14
-xref:spring2023/10200/10200-2023-project13.adoc[Project 13: Data wrangling and matplotlib: part I],April 13,April 21
-xref:spring2023/10200/10200-2023-project14.adoc[Project 14: Data wrangling and matplotlib: part II],April 20,April 28
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/examples/10200-2024-projects.csv b/projects-appendix/modules/ROOT/examples/10200-2024-projects.csv
deleted file mode 100644
index 84e589af2..000000000
--- a/projects-appendix/modules/ROOT/examples/10200-2024-projects.csv
+++ /dev/null
@@ -1,15 +0,0 @@
-Project,Release date,Due date
-xref:spring2024/10200/10200-2024-project01.adoc[Project 1: Getting acquainted with Jupyter Lab],8-Jan,19-Jan
-xref:spring2024/10200/10200-2024-project02.adoc[Project 2: Python tuples/lists/data frames/matplotlib],11-Jan,26-Jan
-xref:spring2024/10200/10200-2024-project03.adoc[Project 3: Looping through files],25-Jan,2-Feb
-xref:spring2024/10200/10200-2024-project04.adoc[Project 4: Looping through data frames],1-Feb,9-Feb
-xref:spring2024/10200/10200-2024-project05.adoc[Project 5: Writing functions for analyzing data],8-Feb,16-Feb
-xref:spring2024/10200/10200-2024-project06.adoc[Project 6: More practice with functions],Feb 15,Feb 23
-xref:spring2024/10200/10200-2024-project07.adoc[Project 7: Even more practice with functions],22-Feb,1-Mar
-xref:spring2024/10200/10200-2024-project08.adoc[Project 8: Another project with functions],Feb 29,Mar 8
-xref:spring2024/10200/10200-2024-project09.adoc[Project 9: Deeper dive into functions and analysis of data frames],7-Mar,22-Mar
-xref:spring2024/10200/10200-2024-project10.adoc[Project 10: Introduction to numpy],21-Mar,29-Mar
-xref:spring2024/10200/10200-2024-project11.adoc[Project 11: Introduction to classes],28-Mar,5-Apr
-xref:spring2024/10200/10200-2024-project12.adoc[Project 12: Deeper dive into classes],4-Apr,12-Apr
-xref:spring2024/10200/10200-2024-project13.adoc[Project 13: Introduction to flask],11-Apr,19-Apr
-xref:spring2024/10200/10200-2024-project14.adoc[Project 14: Feedback about Spring 2024],18-Apr,26-Apr
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/examples/19000-s2022-projects.csv b/projects-appendix/modules/ROOT/examples/19000-s2022-projects.csv
deleted file mode 100644
index 3ab3173a2..000000000
--- a/projects-appendix/modules/ROOT/examples/19000-s2022-projects.csv
+++ /dev/null
@@ -1,15 +0,0 @@
-Project,Release date,Due date
-xref:spring2022/19000/19000-s2022-project01.adoc[Project 1: Introduction to Python: part I],January 6,January 21
-xref:spring2022/19000/19000-s2022-project02.adoc[Project 2: Introduction to Python: part II],January 20,January 28
-xref:spring2022/19000/19000-s2022-project03.adoc[Project 3: Introduction to Python: part III],January 27,February 4
-xref:spring2022/19000/19000-s2022-project04.adoc[Project 4: Scientific computing & pandas: part I],February 3,February 11
-xref:spring2022/19000/19000-s2022-project05.adoc[Project 5: Functions: part I],February 10,February 18
-xref:spring2022/19000/19000-s2022-project06.adoc[Project 6: Functions: part II],February 17,February 25
-xref:spring2022/19000/19000-s2022-project07.adoc[Project 7: Scientific computing & pandas: part II],February 24,March 4
-xref:spring2022/19000/19000-s2022-project08.adoc[Project 8: Scientific computing & pandas: part III],March 3,March 11
-xref:spring2022/19000/19000-s2022-project09.adoc[Project 9: Scientific computing & pandas: part IV],March 17,March 25
-xref:spring2022/19000/19000-s2022-project10.adoc[Project 10: Importing and using packages],March 24,April 1
-"xref:spring2022/19000/19000-s2022-project11.adoc[Project 11: Classes, dunder methods, attributes, methods, etc.: part I]",March 31,April 8
-"xref:spring2022/19000/19000-s2022-project12.adoc[Project 12: Classes, dunder methods, attributes, methods, etc.: part II]",April 7,April 15
-xref:spring2022/19000/19000-s2022-project13.adoc[Project 13: Data wrangling and matplotlib: part I],April 14,April 22
-xref:spring2022/19000/19000-s2022-project14.adoc[Project 14: Data wrangling and matplotlib: part II],April 21,April 29
diff --git a/projects-appendix/modules/ROOT/examples/20100-2022-projects.csv b/projects-appendix/modules/ROOT/examples/20100-2022-projects.csv
deleted file mode 100644
index 1d452abb5..000000000
--- a/projects-appendix/modules/ROOT/examples/20100-2022-projects.csv
+++ /dev/null
@@ -1,14 +0,0 @@
-Project,Release date,Due date
-xref:fall2022/20100/20100-2022-project01.adoc[Project 1: Review: Jupyter Lab],August 22,September 9
-xref:fall2022/20100/20100-2022-project02.adoc[Project 2: Navigating UNIX: part I],August 25,September 9
-xref:fall2022/20100/20100-2022-project03.adoc[Project 3: Navigating UNIX: part II],September 8,September 16
-xref:fall2022/20100/20100-2022-project04.adoc[Project 4: Pattern matching in UNIX & R],September 15,September 23
-xref:fall2022/20100/20100-2022-project05.adoc[Project 5: awk and bash scripts: part I],September 22,September 30
-xref:fall2022/20100/20100-2022-project06.adoc[Project 6: awk & bash scripts: part II],September 29,October 7
-xref:fall2022/20100/20100-2022-project07.adoc[Project 7: awk & bash scripts: part III],October 6,October 21
-xref:fall2022/20100/20100-2022-project08.adoc[Project 8: SQL: part I],October 20,October 28
-xref:fall2022/20100/20100-2022-project09.adoc[Project 9: SQL: part II],October 27,November 4
-xref:fall2022/20100/20100-2022-project10.adoc[Project 10: SQL: part III],November 3,November 11
-xref:fall2022/20100/20100-2022-project11.adoc[Project 11: SQL: part IV],November 10,November 18
-xref:fall2022/20100/20100-2022-project12.adoc[Project 12: SQL: part V],November 17,December 2
-xref:fall2022/20100/20100-2022-project13.adoc[Project 13: SQL: part VI],December 1,December 9
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/examples/20200-2023-projects.csv b/projects-appendix/modules/ROOT/examples/20200-2023-projects.csv
deleted file mode 100644
index 94ca4326d..000000000
--- a/projects-appendix/modules/ROOT/examples/20200-2023-projects.csv
+++ /dev/null
@@ -1,15 +0,0 @@
-Project,Release date,Due date
-xref:spring2023/20200/20200-2023-project01.adoc[Project 1: Introduction to XML],January 9,January 20
-xref:spring2023/20200/20200-2023-project02.adoc[Project 2: Web scraping in Python: part I],January 19,January 27
-xref:spring2023/20200/20200-2023-project03.adoc[Project 3: Web scraping in Python: part II],January 26, February 3
-xref:spring2023/20200/20200-2023-project04.adoc[Project 4: Web scraping in Python: part III],February 2,February 10
-xref:spring2023/20200/20200-2023-project05.adoc[Project 5: Web scraping in Python: part IV],February 9,February 17
-xref:spring2023/20200/20200-2023-project06.adoc[Project 6: Web scraping in Python: part V],February 16,February 24
-xref:spring2023/20200/20200-2023-project07.adoc[Project 7: Plotting in Python: part I],February 23,March 3
-xref:spring2023/20200/20200-2023-project08.adoc[Project 8: Plotting in Python: part II],March 2,March 10
-xref:spring2023/20200/20200-2023-project09.adoc[Project 9: Plotting in Python: part III],March 9,March 24
-xref:spring2023/20200/20200-2023-project10.adoc[Project 10: Plotting with ggplot: part I],March 23,March 31
-xref:spring2023/20200/20200-2023-project11.adoc[Project 11: Plotting with ggplot: part II],March 30,April 7
-xref:spring2023/20200/20200-2023-project12.adoc[Project 12: Tidyverse and data.table: part I],April 6,April 14
-xref:spring2023/20200/20200-2023-project13.adoc[Project 13: Tidyverse and data.table: part II],April 13,April 21
-xref:spring2023/20200/20200-2023-project14.adoc[Project 14: Tidyverse and data.table: part IV],April 20,April 28
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/examples/20200-2024-projects.csv b/projects-appendix/modules/ROOT/examples/20200-2024-projects.csv
deleted file mode 100644
index 2ca993ae4..000000000
--- a/projects-appendix/modules/ROOT/examples/20200-2024-projects.csv
+++ /dev/null
@@ -1,15 +0,0 @@
-Project,Release date,Due date
-xref:spring2024/20200/20200-2024-project01.adoc[Project 1: Review: Jupyter Lab],8-Jan,19-Jan
-xref:spring2024/20200/20200-2024-project02.adoc[Project 2: Introduction to web scraping with BeautifulSoup],11-Jan,26-Jan
-xref:spring2024/20200/20200-2024-project03.adoc[Project 3: Introduction to web scraping with XPath],25-Jan,2-Feb
-xref:spring2024/20200/20200-2024-project04.adoc[Project 4: Analyzing more than one hundred thousand XML files at once],1-Feb,9-Feb
-xref:spring2024/20200/20200-2024-project05.adoc[Project 5: Extracting information about No Starch Press books from the OReilly website using Selenium],8-Feb,16-Feb
-xref:spring2024/20200/20200-2024-project06.adoc[Project 6: Data Visualization],Feb 15,Feb 23
-xref:spring2024/20200/20200-2024-project07.adoc[Project 7: Learning Dash],22-Feb,1-Mar
-xref:spring2024/20200/20200-2024-project08.adoc[Project 8: Introduction to Spark SQL],Feb 29,Mar 8
-xref:spring2024/20200/20200-2024-project09.adoc[Project 9: More Spark SQL and also streaming Spark SQL],7-Mar,22-Mar
-xref:spring2024/20200/20200-2024-project10.adoc[Project 10: Introduction to Machine Learning],21-Mar,29-Mar
-xref:spring2024/20200/20200-2024-project11.adoc[Project 11: More information about Machine Learning],28-Mar,5-Apr
-xref:spring2024/20200/20200-2024-project12.adoc[Project 12: Introduction to containerization],4-Apr,12-Apr
-xref:spring2024/20200/20200-2024-project13.adoc[Project 13: More information about containerization],11-Apr,19-Apr
-xref:spring2024/20200/20200-2024-project14.adoc[Project 14: Feedback about Spring 2024],18-Apr,26-Apr
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/examples/29000-s2022-projects.csv b/projects-appendix/modules/ROOT/examples/29000-s2022-projects.csv
deleted file mode 100644
index 78706a4d6..000000000
--- a/projects-appendix/modules/ROOT/examples/29000-s2022-projects.csv
+++ /dev/null
@@ -1,15 +0,0 @@
-Project,Release date,Due date
-xref:spring2022/29000/29000-s2022-project01.adoc[Project 1: Introduction to XML],January 6,January 21
-xref:spring2022/29000/29000-s2022-project02.adoc[Project 2: Web scraping in Python: part I],January 20,January 28
-xref:spring2022/29000/29000-s2022-project03.adoc[Project 3: Web scraping in Python: part II],January 27,February 4
-xref:spring2022/29000/29000-s2022-project04.adoc[Project 4: Web scraping in Python: part III],February 3,February 11
-xref:spring2022/29000/29000-s2022-project05.adoc[Project 5: Web scraping in Python: part IV],February 10,February 18
-xref:spring2022/29000/29000-s2022-project06.adoc[Project 6: Plotting in Python: part I],February 17,February 25
-xref:spring2022/29000/29000-s2022-project07.adoc[Project 7: Plotting in Python: part II],February 24,March 4
-xref:spring2022/29000/29000-s2022-project08.adoc[Project 8: Writing Python scripts: part I],March 3,March 11
-xref:spring2022/29000/29000-s2022-project09.adoc[Project 9: Writing Python scripts: part II],March 17,March 25
-xref:spring2022/29000/29000-s2022-project10.adoc[Project 10: Plotting with ggplot: part I],March 24,April 1
-xref:spring2022/29000/29000-s2022-project11.adoc[Project 11: Plotting with ggplot: part II],March 31,April 8
-xref:spring2022/29000/29000-s2022-project12.adoc[Project 12: Tidyverse and data.table: part I],April 7,April 15
-xref:spring2022/29000/29000-s2022-project13.adoc[Project 13: Tidyverse and data.table: part II],April 14,April 22
-xref:spring2022/29000/29000-s2022-project14.adoc[Project 14: Tidyverse and data.table: part III],April 21,April 29
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/examples/30100-2022-projects.csv b/projects-appendix/modules/ROOT/examples/30100-2022-projects.csv
deleted file mode 100644
index a8eadc627..000000000
--- a/projects-appendix/modules/ROOT/examples/30100-2022-projects.csv
+++ /dev/null
@@ -1,14 +0,0 @@
-Project,Release date,Due date
-"xref:fall2022/30100/30100-2022-project01.adoc[Project 1: Review: Jupyter Lab]",August 22,September 9
-"xref:fall2022/30100/30100-2022-project02.adoc[Project 2: Python documentation: part I]",August 25,September 9
-"xref:fall2022/30100/30100-2022-project03.adoc[Project 3: Python documentation: part II]",September 8,September 16
-"xref:fall2022/30100/30100-2022-project04.adoc[Project 4: Review: part I]",September 15,September 23
-xref:fall2022/30100/30100-2022-project05.adoc[Project 5: Testing in Python: part I],September 22,September 30
-"xref:fall2022/30100/30100-2022-project06.adoc[Project 6: Testing in Python: part II]",September 29,October 7
-xref:fall2022/30100/30100-2022-project07.adoc[Project 7: Review: part II],October 6,October 21
-xref:fall2022/30100/30100-2022-project08.adoc[Project 8: Virtual environments & packages: part I],October 20,October 28
-xref:fall2022/30100/30100-2022-project09.adoc[Project 9: Virtual environments & packages: part II],October 27,November 4
-xref:fall2022/30100/30100-2022-project10.adoc[Project 10: Virtual environments & packages: part III & APIs: part I],November 3,November 11
-xref:fall2022/30100/30100-2022-project11.adoc[Project 11: APIs: part II],November 10,November 18
-xref:fall2022/30100/30100-2022-project12.adoc[Project 12: APIs: part III],November 17,December 2
-xref:fall2022/30100/30100-2022-project13.adoc[Project 13: APIs: part IV],December 1,December 9
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/examples/30200-2023-projects.csv b/projects-appendix/modules/ROOT/examples/30200-2023-projects.csv
deleted file mode 100644
index 2c8d67763..000000000
--- a/projects-appendix/modules/ROOT/examples/30200-2023-projects.csv
+++ /dev/null
@@ -1,15 +0,0 @@
-Project,Release date,Due date
-"xref:spring2023/30200/30200-2023-project01.adoc[Project 1: Review: UNIX, terminology, etc.]",January 9,January 20
-"xref:spring2023/30200/30200-2023-project02.adoc[Project 2: Concurrency, parallelism, cores, threads: part I]",January 19,January 27
-"xref:spring2023/30200/30200-2023-project03.adoc[Project 3: Concurrency, parallelism, cores, threads: part II]",January 26, February 3
-"xref:spring2023/30200/30200-2023-project04.adoc[Project 4: Concurrency, parallelism, cores, threads: part III]",February 2,February 10
-xref:spring2023/30200/30200-2023-project05.adoc[Project 5: High performance computing on Brown with SLURM: part I],February 9,February 17
-xref:spring2023/30200/30200-2023-project06.adoc[Project 6: High performance computing on Brown with SLURM: part II],February 16,February 24
-xref:spring2023/30200/30200-2023-project07.adoc[Project 7: High performance computer on Brown with SLURM: part III],February 23,March 3
-xref:spring2023/30200/30200-2023-project08.adoc[Project 8: PyTorch & JAX: part I],March 2,March 10
-xref:spring2023/30200/30200-2023-project09.adoc[Project 9: PyTorch & JAX: part II],March 9,March 24
-xref:spring2023/30200/30200-2023-project10.adoc[Project 10: High performance computing on Brown with SLURM: part IV -- GPUs],March 23,March 31
-xref:spring2023/30200/30200-2023-project11.adoc[Project 11: PyTorch & JAX: part III],March 30,April 7
-xref:spring2023/30200/30200-2023-project12.adoc[Project 12: PyTorch & JAX: part IV],April 6,April 14
-xref:spring2023/30200/30200-2023-project13.adoc[Project 13: ETL fun & review: part I],April 13,April 21
-xref:spring2023/30200/30200-2023-project14.adoc[Project 14: ETL fun & review: part II],April 20,April 28
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/examples/30200-2024-projects.csv b/projects-appendix/modules/ROOT/examples/30200-2024-projects.csv
deleted file mode 100644
index 37a628d62..000000000
--- a/projects-appendix/modules/ROOT/examples/30200-2024-projects.csv
+++ /dev/null
@@ -1,15 +0,0 @@
-Project,Release date,Due date
-xref:spring2024/30200/30200-2024-project01.adoc[Project 1],8-Jan,19-Jan
-xref:spring2024/30200/30200-2024-project02.adoc[Project 2],11-Jan,26-Jan
-xref:spring2024/30200/30200-2024-project03.adoc[Project 3],25-Jan,2-Feb
-xref:spring2024/30200/30200-2024-project04.adoc[Project 4],1-Feb,9-Feb
-xref:spring2024/30200/30200-2024-project05.adoc[Project 5],8-Feb,16-Feb
-xref:spring2024/30200/30200-2024-project06.adoc[Project 6],Feb 15,Feb 23
-xref:spring2024/30200/30200-2024-project07.adoc[Project 7],22-Feb,1-Mar
-xref:spring2024/30200/30200-2024-project08.adoc[Project 8],Feb 29,Mar 8
-xref:spring2024/30200/30200-2024-project09.adoc[Project 9],7-Mar,22-Mar
-xref:spring2024/30200/30200-2024-project10.adoc[Project 10],21-Mar,29-Mar
-xref:spring2024/30200/30200-2024-project11.adoc[Project 11],28-Mar,5-Apr
-xref:spring2024/30200/30200-2024-project12.adoc[Project 12],4-Apr,12-Apr
-xref:spring2024/30200/30200-2024-project13.adoc[Project 13],11-Apr,19-Apr
-xref:spring2024/30200/30200-2024-project14.adoc[Project 14],18-Apr,26-Apr
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/examples/39000-s2022-projects.csv b/projects-appendix/modules/ROOT/examples/39000-s2022-projects.csv
deleted file mode 100644
index 8ae22472f..000000000
--- a/projects-appendix/modules/ROOT/examples/39000-s2022-projects.csv
+++ /dev/null
@@ -1,15 +0,0 @@
-Project,Release date,Due date
-"xref:spring2022/39000/39000-s2022-project01.adoc[Project 1: Review: UNIX, terminology, etc.]",January 6,January 21
-"xref:spring2022/39000/39000-s2022-project02.adoc[Project 2: Concurrency, parallelism, cores, threads: part I]",January 20,January 28
-"xref:spring2022/39000/39000-s2022-project03.adoc[Project 3: Concurrency, parallelism, cores, threads: part II]",January 27,February 4
-"xref:spring2022/39000/39000-s2022-project04.adoc[Project 4: Concurrency, parallelism, cores, threads: part III]",February 3,February 11
-xref:spring2022/39000/39000-s2022-project05.adoc[Project 5: High performance computing on Brown with SLURM: part I],February 10,February 18
-xref:spring2022/39000/39000-s2022-project06.adoc[Project 6: High performance computing on Brown with SLURM: part II],February 17,February 25
-xref:spring2022/39000/39000-s2022-project07.adoc[Project 7: High performance computer on Brown with SLURM: part III],February 24,March 4
-xref:spring2022/39000/39000-s2022-project08.adoc[Project 8: PyTorch & JAX: part I],March 3,March 11
-xref:spring2022/39000/39000-s2022-project09.adoc[Project 9: PyTorch & JAX: part II],March 17,March 25
-xref:spring2022/39000/39000-s2022-project10.adoc[Project 10: High performance computing on Brown with SLURM: part IV -- GPUs],March 24,April 1
-xref:spring2022/39000/39000-s2022-project11.adoc[Project 11: PyTorch & JAX: part III],March 31,April 8
-xref:spring2022/39000/39000-s2022-project12.adoc[Project 12: PyTorch & JAX: part IV],April 7,April 15
-xref:spring2022/39000/39000-s2022-project13.adoc[Project 13: ETL fun & review: part I],April 14,April 22
-xref:spring2022/39000/39000-s2022-project14.adoc[Project 14: ETL fun & review: part II],April 21,April 29
diff --git a/projects-appendix/modules/ROOT/examples/40100-2022-projects.csv b/projects-appendix/modules/ROOT/examples/40100-2022-projects.csv
deleted file mode 100644
index aaf44f432..000000000
--- a/projects-appendix/modules/ROOT/examples/40100-2022-projects.csv
+++ /dev/null
@@ -1,14 +0,0 @@
-Project,Release date,Due date
-"xref:fall2022/40100/40100-2022-project01.adoc[Project 1: Review: Jupyter Lab]",August 22,September 9
-"xref:fall2022/40100/40100-2022-project02.adoc[Project 2: SQLite deepish dive: part I]",August 25,September 9
-"xref:fall2022/40100/40100-2022-project03.adoc[Project 3: SQLite deepish dive: part II]",September 8,September 16
-"xref:fall2022/40100/40100-2022-project04.adoc[Project 4: SQLite deepish dive: part III]",September 15,September 23
-"xref:fall2022/40100/40100-2022-project05.adoc[Project 5: SQLite deepish dive: part IV]",September 22,September 30
-"xref:fall2022/40100/40100-2022-project06.adoc[Project 6: Working with images: part I]",September 29,October 7
-xref:fall2022/40100/40100-2022-project07.adoc[Project 7: Working with images: part II],October 6,October 21
-xref:fall2022/40100/40100-2022-project08.adoc[Project 8: Working with images: part III],October 20,October 28
-xref:fall2022/40100/40100-2022-project09.adoc[Project 9: Working with images: part IV],October 27,November 4
-xref:fall2022/40100/40100-2022-project10.adoc[Project 10: Web scraping and mixed topics: part I],November 3,November 11
-xref:fall2022/40100/40100-2022-project11.adoc[Project 11: Web scraping and mixed topics: part II],November 10,November 18
-xref:fall2022/40100/40100-2022-project12.adoc[Project 12: Web scraping and mixed topics: part III],November 17,December 2
-xref:fall2022/40100/40100-2022-project13.adoc[Project 13: Web scraping and mixed topics: part IV],December 1,December 9
diff --git a/projects-appendix/modules/ROOT/examples/40200-2023-projects.csv b/projects-appendix/modules/ROOT/examples/40200-2023-projects.csv
deleted file mode 100644
index 203dfb6ab..000000000
--- a/projects-appendix/modules/ROOT/examples/40200-2023-projects.csv
+++ /dev/null
@@ -1,15 +0,0 @@
-Project,Release date,Due date
-"xref:spring2023/40200/40200-2023-project01.adoc[Project 1: Review JAX]",January 9,January 20
-"xref:spring2023/40200/40200-2023-project02.adoc[Project 2: Building a dashboard: part I]",January 19,January 27
-"xref:spring2023/40200/40200-2023-project03.adoc[Project 3: Building a dashboard: part II]",January 26, February 3
-"xref:spring2023/40200/40200-2023-project04.adoc[Project 4: Building a dashboard: part III]",February 2,February 10
-xref:spring2023/40200/40200-2023-project05.adoc[Project 5: Building a dashboard: part IV],February 9,February 17
-xref:spring2023/40200/40200-2023-project06.adoc[Project 6: Building a dashboard: part V],February 16,February 24
-xref:spring2023/40200/40200-2023-project07.adoc[Project 7: Building a dashboard: part VI],February 23,March 3
-xref:spring2023/40200/40200-2023-project08.adoc[Project 8: Building a dashboard: part VII],March 2,March 10
-xref:spring2023/40200/40200-2023-project09.adoc[Project 9: Building a dashboard: part VIII],March 9,March 24
-xref:spring2023/40200/40200-2023-project10.adoc[Project 10: Building a dashboard: part IX],March 23,March 31
-xref:spring2023/40200/40200-2023-project11.adoc[Project 11: Containers: part I],March 30,April 7
-xref:spring2023/40200/40200-2023-project12.adoc[Project 12: Containers: part II],April 6,April 14
-xref:spring2023/40200/40200-2023-project13.adoc[Project 13: Containers: part III],April 13,April 21
-xref:spring2023/40200/40200-2023-project14.adoc[Project 14: Containers: part IV],April 20,April 28
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/examples/40200-2024-projects.csv b/projects-appendix/modules/ROOT/examples/40200-2024-projects.csv
deleted file mode 100644
index 3e0ab71c8..000000000
--- a/projects-appendix/modules/ROOT/examples/40200-2024-projects.csv
+++ /dev/null
@@ -1,15 +0,0 @@
-Project,Release date,Due date
-xref:spring2024/40200/40200-2024-project01.adoc[Project 1],8-Jan,19-Jan
-xref:spring2024/40200/40200-2024-project02.adoc[Project 2],11-Jan,26-Jan
-xref:spring2024/40200/40200-2024-project03.adoc[Project 3],25-Jan,2-Feb
-xref:spring2024/40200/40200-2024-project04.adoc[Project 4],1-Feb,9-Feb
-xref:spring2024/40200/40200-2024-project05.adoc[Project 5],8-Feb,16-Feb
-xref:spring2024/40200/40200-2024-project06.adoc[Project 6],Feb 15,Feb 23
-xref:spring2024/40200/40200-2024-project07.adoc[Project 7],22-Feb,1-Mar
-xref:spring2024/40200/40200-2024-project08.adoc[Project 8],Feb 29,Mar 8
-xref:spring2024/40200/40200-2024-project09.adoc[Project 9],7-Mar,22-Mar
-xref:spring2024/40200/40200-2024-project10.adoc[Project 10],21-Mar,29-Mar
-xref:spring2024/40200/40200-2024-project11.adoc[Project 11],28-Mar,5-Apr
-xref:spring2024/40200/40200-2024-project12.adoc[Project 12],4-Apr,12-Apr
-xref:spring2024/40200/40200-2024-project13.adoc[Project 13],11-Apr,19-Apr
-xref:spring2024/40200/40200-2024-project14.adoc[Project 14],18-Apr,26-Apr
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/images/f24-101-OH.png b/projects-appendix/modules/ROOT/images/f24-101-OH.png
deleted file mode 100644
index baf688047..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-101-OH.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/f24-101-p1-1.png b/projects-appendix/modules/ROOT/images/f24-101-p1-1.png
deleted file mode 100644
index 5725b1061..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-101-p1-1.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/f24-101-p1-2.png b/projects-appendix/modules/ROOT/images/f24-101-p1-2.png
deleted file mode 100644
index 5408aefc7..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-101-p1-2.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/f24-101-p1-3.png b/projects-appendix/modules/ROOT/images/f24-101-p1-3.png
deleted file mode 100644
index 25af69c7c..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-101-p1-3.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/f24-101-p1-4.png b/projects-appendix/modules/ROOT/images/f24-101-p1-4.png
deleted file mode 100644
index 953d09ff8..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-101-p1-4.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/f24-101-p10-1.png b/projects-appendix/modules/ROOT/images/f24-101-p10-1.png
deleted file mode 100644
index 5375cc4c5..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-101-p10-1.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/f24-201-OH.png b/projects-appendix/modules/ROOT/images/f24-201-OH.png
deleted file mode 100644
index c321b39a1..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-201-OH.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/f24-201-p1-1.png b/projects-appendix/modules/ROOT/images/f24-201-p1-1.png
deleted file mode 100644
index 12dd5bdb8..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-201-p1-1.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/f24-301-OH.png b/projects-appendix/modules/ROOT/images/f24-301-OH.png
deleted file mode 100644
index 850d087e7..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-301-OH.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/f24-301-p11-1.PNG b/projects-appendix/modules/ROOT/images/f24-301-p11-1.PNG
deleted file mode 100644
index 543e10762..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-301-p11-1.PNG and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/f24-301-p5-1.png b/projects-appendix/modules/ROOT/images/f24-301-p5-1.png
deleted file mode 100644
index 401c70ccb..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-301-p5-1.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/f24-301-p7-1-2.PNG b/projects-appendix/modules/ROOT/images/f24-301-p7-1-2.PNG
deleted file mode 100644
index bbd1a2b9f..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-301-p7-1-2.PNG and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/f24-301-p7-1.PNG b/projects-appendix/modules/ROOT/images/f24-301-p7-1.PNG
deleted file mode 100644
index 13cae4be7..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-301-p7-1.PNG and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/f24-301-p8-1.png b/projects-appendix/modules/ROOT/images/f24-301-p8-1.png
deleted file mode 100644
index 075de94f7..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-301-p8-1.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/f24-301-p8-2.png b/projects-appendix/modules/ROOT/images/f24-301-p8-2.png
deleted file mode 100644
index 36926d4f2..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-301-p8-2.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/f24-401-OH.png b/projects-appendix/modules/ROOT/images/f24-401-OH.png
deleted file mode 100644
index 4e0bbec28..000000000
Binary files a/projects-appendix/modules/ROOT/images/f24-401-OH.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure01.webp b/projects-appendix/modules/ROOT/images/figure01.webp
deleted file mode 100644
index f07a29780..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure01.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure02.webp b/projects-appendix/modules/ROOT/images/figure02.webp
deleted file mode 100644
index 2460ca5ec..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure02.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure03.webp b/projects-appendix/modules/ROOT/images/figure03.webp
deleted file mode 100644
index 064c14c82..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure03.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure04.webp b/projects-appendix/modules/ROOT/images/figure04.webp
deleted file mode 100644
index e836fd479..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure04.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure05.webp b/projects-appendix/modules/ROOT/images/figure05.webp
deleted file mode 100644
index a6298c950..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure05.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure06.webp b/projects-appendix/modules/ROOT/images/figure06.webp
deleted file mode 100644
index 4c543c1ed..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure06.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure07.webp b/projects-appendix/modules/ROOT/images/figure07.webp
deleted file mode 100644
index 206ad2fb9..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure07.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure08.webp b/projects-appendix/modules/ROOT/images/figure08.webp
deleted file mode 100644
index df664269e..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure08.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure09.webp b/projects-appendix/modules/ROOT/images/figure09.webp
deleted file mode 100644
index 3928998ac..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure09.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure10.webp b/projects-appendix/modules/ROOT/images/figure10.webp
deleted file mode 100644
index 1e9910f81..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure10.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure11.webp b/projects-appendix/modules/ROOT/images/figure11.webp
deleted file mode 100644
index 9ea314a0e..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure11.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure12.webp b/projects-appendix/modules/ROOT/images/figure12.webp
deleted file mode 100644
index 905bc1de7..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure12.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure13.webp b/projects-appendix/modules/ROOT/images/figure13.webp
deleted file mode 100644
index c9690ef1d..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure13.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure14.webp b/projects-appendix/modules/ROOT/images/figure14.webp
deleted file mode 100644
index 7773bc4ba..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure14.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure15.webp b/projects-appendix/modules/ROOT/images/figure15.webp
deleted file mode 100644
index 7a1fc82cb..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure15.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure16.webp b/projects-appendix/modules/ROOT/images/figure16.webp
deleted file mode 100644
index 7eef43f50..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure16.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure17.webp b/projects-appendix/modules/ROOT/images/figure17.webp
deleted file mode 100644
index 0a899198f..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure17.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure18.webp b/projects-appendix/modules/ROOT/images/figure18.webp
deleted file mode 100644
index c0f15eb3e..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure18.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure19.webp b/projects-appendix/modules/ROOT/images/figure19.webp
deleted file mode 100644
index 4e8335939..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure19.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure20.webp b/projects-appendix/modules/ROOT/images/figure20.webp
deleted file mode 100644
index 5625a90a2..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure20.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure21.webp b/projects-appendix/modules/ROOT/images/figure21.webp
deleted file mode 100644
index 08b955b56..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure21.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure22.webp b/projects-appendix/modules/ROOT/images/figure22.webp
deleted file mode 100644
index ec1850e8e..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure22.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure23.webp b/projects-appendix/modules/ROOT/images/figure23.webp
deleted file mode 100644
index 516ce478a..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure23.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure24.webp b/projects-appendix/modules/ROOT/images/figure24.webp
deleted file mode 100644
index 69b38477d..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure24.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure25.webp b/projects-appendix/modules/ROOT/images/figure25.webp
deleted file mode 100644
index 3b0daa1b4..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure25.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure26.webp b/projects-appendix/modules/ROOT/images/figure26.webp
deleted file mode 100644
index a8c6c507f..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure26.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure27.webp b/projects-appendix/modules/ROOT/images/figure27.webp
deleted file mode 100644
index fe0db74b3..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure27.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure28.webp b/projects-appendix/modules/ROOT/images/figure28.webp
deleted file mode 100644
index 79de2ddf5..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure28.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure29.webp b/projects-appendix/modules/ROOT/images/figure29.webp
deleted file mode 100644
index cf915d268..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure29.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure30.webp b/projects-appendix/modules/ROOT/images/figure30.webp
deleted file mode 100644
index 120209141..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure30.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure31.webp b/projects-appendix/modules/ROOT/images/figure31.webp
deleted file mode 100644
index 923057bdb..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure31.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure32.webp b/projects-appendix/modules/ROOT/images/figure32.webp
deleted file mode 100644
index 4d482bd62..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure32.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/figure33.webp b/projects-appendix/modules/ROOT/images/figure33.webp
deleted file mode 100644
index 3a67633f3..000000000
Binary files a/projects-appendix/modules/ROOT/images/figure33.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/stat19000project2figure1.png b/projects-appendix/modules/ROOT/images/stat19000project2figure1.png
deleted file mode 100644
index 44821c03a..000000000
Binary files a/projects-appendix/modules/ROOT/images/stat19000project2figure1.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/stat19000project2figure2.png b/projects-appendix/modules/ROOT/images/stat19000project2figure2.png
deleted file mode 100644
index a98bb800c..000000000
Binary files a/projects-appendix/modules/ROOT/images/stat19000project2figure2.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/stat19000project2figure3.png b/projects-appendix/modules/ROOT/images/stat19000project2figure3.png
deleted file mode 100644
index 7d1cf3064..000000000
Binary files a/projects-appendix/modules/ROOT/images/stat19000project2figure3.png and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure01.webp b/projects-appendix/modules/ROOT/images/think-summer-figure01.webp
deleted file mode 100644
index eed3513bf..000000000
Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure01.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure02.webp b/projects-appendix/modules/ROOT/images/think-summer-figure02.webp
deleted file mode 100644
index bab93e4ce..000000000
Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure02.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure03.webp b/projects-appendix/modules/ROOT/images/think-summer-figure03.webp
deleted file mode 100644
index 04205a7dd..000000000
Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure03.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure04.webp b/projects-appendix/modules/ROOT/images/think-summer-figure04.webp
deleted file mode 100644
index e38ea94a1..000000000
Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure04.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure05.webp b/projects-appendix/modules/ROOT/images/think-summer-figure05.webp
deleted file mode 100644
index 0e3c82cc7..000000000
Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure05.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure06.webp b/projects-appendix/modules/ROOT/images/think-summer-figure06.webp
deleted file mode 100644
index d4f90f050..000000000
Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure06.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure07.webp b/projects-appendix/modules/ROOT/images/think-summer-figure07.webp
deleted file mode 100644
index 54c103603..000000000
Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure07.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure08.webp b/projects-appendix/modules/ROOT/images/think-summer-figure08.webp
deleted file mode 100644
index 60a41529b..000000000
Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure08.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure09.webp b/projects-appendix/modules/ROOT/images/think-summer-figure09.webp
deleted file mode 100644
index 99ccc491e..000000000
Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure09.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure10.webp b/projects-appendix/modules/ROOT/images/think-summer-figure10.webp
deleted file mode 100644
index 02ab97a54..000000000
Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure10.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure11.webp b/projects-appendix/modules/ROOT/images/think-summer-figure11.webp
deleted file mode 100644
index 72a17da10..000000000
Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure11.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure12.webp b/projects-appendix/modules/ROOT/images/think-summer-figure12.webp
deleted file mode 100644
index 283622b96..000000000
Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure12.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure13.webp b/projects-appendix/modules/ROOT/images/think-summer-figure13.webp
deleted file mode 100644
index 085e2218c..000000000
Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure13.webp and /dev/null differ
diff --git a/projects-appendix/modules/ROOT/nav.adoc b/projects-appendix/modules/ROOT/nav.adoc
deleted file mode 100644
index aeadd332a..000000000
--- a/projects-appendix/modules/ROOT/nav.adoc
+++ /dev/null
@@ -1,488 +0,0 @@
-* Fall 2024
-** xref:fall2024/logistics/office_hours.adoc[Course Office Hours]
-** xref:fall2024/logistics/syllabus.adoc[Course Syllabus]
-** https://datamine.purdue.edu/events/[Outside Events]
-** https://www.piazza.com[Piazza]
-** https://ondemand.anvil.rcac.purdue.edu[Anvil]
-** https://www.gradescope.com[Gradescope]
-** xref:fall2024/10100/10100-2024-projects.adoc[TDM 10100]
-*** xref:fall2024/10100/10100-2024-project1.adoc[Project 1]
-*** xref:fall2024/10100/10100-2024-project2.adoc[Project 2]
-*** xref:fall2024/10100/10100-2024-project3.adoc[Project 3]
-*** xref:fall2024/10100/10100-2024-project4.adoc[Project 4]
-*** xref:fall2024/10100/10100-2024-project5.adoc[Project 5]
-*** xref:fall2024/10100/10100-2024-project6.adoc[Project 6]
-*** xref:fall2024/10100/10100-2024-project7.adoc[Project 7]
-*** xref:fall2024/10100/10100-2024-project8.adoc[Project 8]
-*** xref:fall2024/10100/10100-2024-project9.adoc[Project 9]
-*** xref:fall2024/10100/10100-2024-project10.adoc[Project 10]
-*** xref:fall2024/10100/10100-2024-project11.adoc[Project 11]
-*** xref:fall2024/10100/10100-2024-project12.adoc[Project 12]
-*** xref:fall2024/10100/10100-2024-project13.adoc[Project 13]
-*** xref:fall2024/10100/10100-2024-project14.adoc[Project 14]
-** xref:fall2024/20100/20100-2024-projects.adoc[TDM 20100]
-*** xref:fall2024/20100/20100-2024-project1.adoc[Project 1]
-*** xref:fall2024/20100/20100-2024-project2.adoc[Project 2]
-*** xref:fall2024/20100/20100-2024-project3.adoc[Project 3]
-*** xref:fall2024/20100/20100-2024-project4.adoc[Project 4]
-*** xref:fall2024/20100/20100-2024-project5.adoc[Project 5]
-*** xref:fall2024/20100/20100-2024-project6.adoc[Project 6]
-*** xref:fall2024/20100/20100-2024-project7.adoc[Project 7]
-*** xref:fall2024/20100/20100-2024-project8.adoc[Project 8]
-*** xref:fall2024/20100/20100-2024-project9.adoc[Project 9]
-*** xref:fall2024/20100/20100-2024-project10.adoc[Project 10]
-*** xref:fall2024/20100/20100-2024-project11.adoc[Project 11]
-*** xref:fall2024/20100/20100-2024-project12.adoc[Project 12]
-*** xref:fall2024/20100/20100-2024-project13.adoc[Project 13]
-*** xref:fall2024/20100/20100-2024-project14.adoc[Project 14]
-** xref:fall2024/30100/30100-2024-projects.adoc[TDM 30100]
-*** xref:fall2024/30100/30100-2024-project1.adoc[Project 1]
-*** xref:fall2024/30100/30100-2024-project2.adoc[Project 2]
-*** xref:fall2024/30100/30100-2024-project3.adoc[Project 3]
-*** xref:fall2024/30100/30100-2024-project4.adoc[Project 4]
-*** xref:fall2024/30100/30100-2024-project5.adoc[Project 5]
-*** xref:fall2024/30100/30100-2024-project6.adoc[Project 6]
-*** xref:fall2024/30100/30100-2024-project7.adoc[Project 7]
-*** xref:fall2024/30100/30100-2024-project8.adoc[Project 8]
-*** xref:fall2024/30100/30100-2024-project9.adoc[Project 9]
-*** xref:fall2024/30100/30100-2024-project10.adoc[Project 10]
-*** xref:fall2024/30100/30100-2024-project11.adoc[Project 11]
-*** xref:fall2024/30100/30100-2024-project12.adoc[Project 12]
-*** xref:fall2024/30100/30100-2024-project13.adoc[Project 13]
-*** xref:fall2024/30100/30100-2024-project14.adoc[Project 14]
-** xref:fall2024/40100/40100-2024-projects.adoc[TDM 40100]
-*** xref:fall2024/40100/40100-2024-project1.adoc[Project 1]
-*** xref:fall2024/40100/40100-2024-project2.adoc[Project 2]
-*** xref:fall2024/40100/40100-2024-project3.adoc[Project 3]
-*** xref:fall2024/40100/40100-2024-project4.adoc[Project 4]
-*** xref:fall2024/40100/40100-2024-project5.adoc[Project 5]
-*** xref:fall2024/40100/40100-2024-project6.adoc[Project 6]
-*** xref:fall2024/40100/40100-2024-project7.adoc[Project 7]
-*** xref:fall2024/40100/40100-2024-project8.adoc[Project 8]
-*** xref:fall2024/40100/40100-2024-project9.adoc[Project 9]
-*** xref:fall2024/40100/40100-2024-project10.adoc[Project 10]
-*** xref:fall2024/40100/40100-2024-project11.adoc[Project 11]
-*** xref:fall2024/40100/40100-2024-project12.adoc[Project 12]
-*** xref:fall2024/40100/40100-2024-project13.adoc[Project 13]
-*** xref:fall2024/40100/40100-2024-project14.adoc[Project 14]
-
-* Project Archive
-** Fall 2020
-*** STAT 19000
-**** xref:fall2020/19000/19000-f2020-project01.adoc[Project 1]
-**** xref:fall2020/19000/19000-f2020-project02.adoc[Project 2]
-**** xref:fall2020/19000/19000-f2020-project03.adoc[Project 3]
-**** xref:fall2020/19000/19000-f2020-project04.adoc[Project 4]
-**** xref:fall2020/19000/19000-f2020-project05.adoc[Project 5]
-**** xref:fall2020/19000/19000-f2020-project06.adoc[Project 6]
-**** xref:fall2020/19000/19000-f2020-project07.adoc[Project 7]
-**** xref:fall2020/19000/19000-f2020-project08.adoc[Project 8]
-**** xref:fall2020/19000/19000-f2020-project09.adoc[Project 9]
-**** xref:fall2020/19000/19000-f2020-project10.adoc[Project 10]
-**** xref:fall2020/19000/19000-f2020-project11.adoc[Project 11]
-**** xref:fall2020/19000/19000-f2020-project12.adoc[Project 12]
-**** xref:fall2020/19000/19000-f2020-project13.adoc[Project 13]
-**** xref:fall2020/19000/19000-f2020-project14.adoc[Project 14]
-**** xref:fall2020/19000/19000-f2020-project15.adoc[Project 15]
-*** STAT 29000
-**** xref:fall2020/29000/29000-f2020-project01.adoc[Project 1]
-**** xref:fall2020/29000/29000-f2020-project02.adoc[Project 2]
-**** xref:fall2020/29000/29000-f2020-project03.adoc[Project 3]
-**** xref:fall2020/29000/29000-f2020-project04.adoc[Project 4]
-**** xref:fall2020/29000/29000-f2020-project05.adoc[Project 5]
-**** xref:fall2020/29000/29000-f2020-project06.adoc[Project 6]
-**** xref:fall2020/29000/29000-f2020-project07.adoc[Project 7]
-**** xref:fall2020/29000/29000-f2020-project08.adoc[Project 8]
-**** xref:fall2020/29000/29000-f2020-project09.adoc[Project 9]
-**** xref:fall2020/29000/29000-f2020-project10.adoc[Project 10]
-**** xref:fall2020/29000/29000-f2020-project11.adoc[Project 11]
-**** xref:fall2020/29000/29000-f2020-project12.adoc[Project 12]
-**** xref:fall2020/29000/29000-f2020-project13.adoc[Project 13]
-**** xref:fall2020/29000/29000-f2020-project14.adoc[Project 14]
-**** xref:fall2020/29000/29000-f2020-project15.adoc[Project 15]
-*** STAT 39000
-**** xref:fall2020/39000/39000-f2020-project01.adoc[Project 1]
-**** xref:fall2020/39000/39000-f2020-project02.adoc[Project 2]
-**** xref:fall2020/39000/39000-f2020-project03.adoc[Project 3]
-**** xref:fall2020/39000/39000-f2020-project04.adoc[Project 4]
-**** xref:fall2020/39000/39000-f2020-project05.adoc[Project 5]
-**** xref:fall2020/39000/39000-f2020-project06.adoc[Project 6]
-**** xref:fall2020/39000/39000-f2020-project07.adoc[Project 7]
-**** xref:fall2020/39000/39000-f2020-project08.adoc[Project 8]
-**** xref:fall2020/39000/39000-f2020-project09.adoc[Project 9]
-**** xref:fall2020/39000/39000-f2020-project10.adoc[Project 10]
-**** xref:fall2020/39000/39000-f2020-project11.adoc[Project 11]
-**** xref:fall2020/39000/39000-f2020-project12.adoc[Project 12]
-**** xref:fall2020/39000/39000-f2020-project13.adoc[Project 13]
-**** xref:fall2020/39000/39000-f2020-project14.adoc[Project 14]
-**** xref:fall2020/39000/39000-f2020-project15.adoc[Project 15]
-** Spring 2021
-*** STAT 19000
-**** xref:spring2021/19000/19000-s2021-project01.adoc[Project 1]
-**** xref:spring2021/19000/19000-s2021-project02.adoc[Project 2]
-**** xref:spring2021/19000/19000-s2021-project03.adoc[Project 3]
-**** xref:spring2021/19000/19000-s2021-project04.adoc[Project 4]
-**** xref:spring2021/19000/19000-s2021-project05.adoc[Project 5]
-**** xref:spring2021/19000/19000-s2021-project06.adoc[Project 6]
-**** xref:spring2021/19000/19000-s2021-project07.adoc[Project 7]
-**** xref:spring2021/19000/19000-s2021-project08.adoc[Project 8]
-**** xref:spring2021/19000/19000-s2021-project09.adoc[Project 9]
-**** xref:spring2021/19000/19000-s2021-project10.adoc[Project 10]
-**** xref:spring2021/19000/19000-s2021-project11.adoc[Project 11]
-**** xref:spring2021/19000/19000-s2021-project12.adoc[Project 12]
-**** xref:spring2021/19000/19000-s2021-project13.adoc[Project 13]
-**** xref:spring2021/19000/19000-s2021-project14.adoc[Project 14]
-**** xref:spring2021/19000/19000-s2021-project15.adoc[Project 15]
-*** STAT 29000
-**** xref:spring2021/29000/29000-s2021-project01.adoc[Project 1]
-**** xref:spring2021/29000/29000-s2021-project02.adoc[Project 2]
-**** xref:spring2021/29000/29000-s2021-project03.adoc[Project 3]
-**** xref:spring2021/29000/29000-s2021-project04.adoc[Project 4]
-**** xref:spring2021/29000/29000-s2021-project05.adoc[Project 5]
-**** xref:spring2021/29000/29000-s2021-project06.adoc[Project 6]
-**** xref:spring2021/29000/29000-s2021-project07.adoc[Project 7]
-**** xref:spring2021/29000/29000-s2021-project08.adoc[Project 8]
-**** xref:spring2021/29000/29000-s2021-project09.adoc[Project 9]
-**** xref:spring2021/29000/29000-s2021-project10.adoc[Project 10]
-**** xref:spring2021/29000/29000-s2021-project11.adoc[Project 11]
-**** xref:spring2021/29000/29000-s2021-project12.adoc[Project 12]
-**** xref:spring2021/29000/29000-s2021-project13.adoc[Project 13]
-**** xref:spring2021/29000/29000-s2021-project14.adoc[Project 14]
-**** xref:spring2021/29000/29000-s2021-project15.adoc[Project 15]
-*** STAT 39000
-**** xref:spring2021/39000/39000-s2021-project01.adoc[Project 1]
-**** xref:spring2021/39000/39000-s2021-project02.adoc[Project 2]
-**** xref:spring2021/39000/39000-s2021-project03.adoc[Project 3]
-**** xref:spring2021/39000/39000-s2021-project04.adoc[Project 4]
-**** xref:spring2021/39000/39000-s2021-project05.adoc[Project 5]
-**** xref:spring2021/39000/39000-s2021-project06.adoc[Project 6]
-**** xref:spring2021/39000/39000-s2021-project07.adoc[Project 7]
-**** xref:spring2021/39000/39000-s2021-project08.adoc[Project 8]
-**** xref:spring2021/39000/39000-s2021-project09.adoc[Project 9]
-**** xref:spring2021/39000/39000-s2021-project10.adoc[Project 10]
-**** xref:spring2021/39000/39000-s2021-project11.adoc[Project 11]
-**** xref:spring2021/39000/39000-s2021-project12.adoc[Project 12]
-**** xref:spring2021/39000/39000-s2021-project13.adoc[Project 13]
-**** xref:spring2021/39000/39000-s2021-project14.adoc[Project 14]
-**** xref:spring2021/39000/39000-s2021-project15.adoc[Project 15]
-** Fall 2021
-*** xref:fall2021/19000/19000-f2021-projects.adoc[STAT 19000]
-**** xref:fall2021/logistics/19000-f2021-officehours.adoc[Office Hours]
-**** xref:fall2021/19000/19000-f2021-project01.adoc[Project 1]
-**** xref:fall2021/19000/19000-f2021-project02.adoc[Project 2]
-**** xref:fall2021/19000/19000-f2021-project03.adoc[Project 3]
-**** xref:fall2021/19000/19000-f2021-project04.adoc[Project 4]
-**** xref:fall2021/19000/19000-f2021-project05.adoc[Project 5]
-**** xref:fall2021/19000/19000-f2021-project06.adoc[Project 6]
-**** xref:fall2021/19000/19000-f2021-project07.adoc[Project 7]
-**** xref:fall2021/19000/19000-f2021-project08.adoc[Project 8]
-**** xref:fall2021/19000/19000-f2021-project09.adoc[Project 9]
-**** xref:fall2021/19000/19000-f2021-project10.adoc[Project 10]
-**** xref:fall2021/19000/19000-f2021-project11.adoc[Project 11]
-**** xref:fall2021/19000/19000-f2021-project12.adoc[Project 12]
-**** xref:fall2021/19000/19000-f2021-project13.adoc[Project 13]
-*** xref:fall2021/29000/29000-f2021-projects.adoc[STAT 29000]
-**** xref:fall2021/logistics/29000-f2021-officehours.adoc[Office Hours]
-**** xref:fall2021/29000/29000-f2021-project01.adoc[Project 1]
-**** xref:fall2021/29000/29000-f2021-project02.adoc[Project 2]
-**** xref:fall2021/29000/29000-f2021-project03.adoc[Project 3]
-**** xref:fall2021/29000/29000-f2021-project04.adoc[Project 4]
-**** xref:fall2021/29000/29000-f2021-project05.adoc[Project 5]
-**** xref:fall2021/29000/29000-f2021-project06.adoc[Project 6]
-**** xref:fall2021/29000/29000-f2021-project07.adoc[Project 7]
-**** xref:fall2021/29000/29000-f2021-project08.adoc[Project 8]
-**** xref:fall2021/29000/29000-f2021-project09.adoc[Project 9]
-**** xref:fall2021/29000/29000-f2021-project10.adoc[Project 10]
-**** xref:fall2021/29000/29000-f2021-project11.adoc[Project 11]
-**** xref:fall2021/29000/29000-f2021-project12.adoc[Project 12]
-**** xref:fall2021/29000/29000-f2021-project13.adoc[Project 13]
-*** xref:fall2021/39000/39000-f2021-projects.adoc[STAT 39000]
-**** xref:fall2021/logistics/39000-f2021-officehours.adoc[Office Hours]
-**** xref:fall2021/39000/39000-f2021-project01.adoc[Project 1]
-**** xref:fall2021/39000/39000-f2021-project02.adoc[Project 2]
-**** xref:fall2021/39000/39000-f2021-project03.adoc[Project 3]
-**** xref:fall2021/39000/39000-f2021-project04.adoc[Project 4]
-**** xref:fall2021/39000/39000-f2021-project05.adoc[Project 5]
-**** xref:fall2021/39000/39000-f2021-project06.adoc[Project 6]
-**** xref:fall2021/39000/39000-f2021-project07.adoc[Project 7]
-**** xref:fall2021/39000/39000-f2021-project08.adoc[Project 8]
-**** xref:fall2021/39000/39000-f2021-project09.adoc[Project 9]
-**** xref:fall2021/39000/39000-f2021-project10.adoc[Project 10]
-**** xref:fall2021/39000/39000-f2021-project11.adoc[Project 11]
-**** xref:fall2021/39000/39000-f2021-project12.adoc[Project 12]
-**** xref:fall2021/39000/39000-f2021-project13.adoc[Project 13]
-** Spring 2022
-*** xref:spring2022/19000/19000-s2022-projects.adoc[STAT 19000]
-**** xref:spring2022/19000/19000-s2022-project01.adoc[Project 1]
-**** xref:spring2022/19000/19000-s2022-project02.adoc[Project 2]
-**** xref:spring2022/19000/19000-s2022-project03.adoc[Project 3]
-**** xref:spring2022/19000/19000-s2022-project04.adoc[Project 4]
-**** xref:spring2022/19000/19000-s2022-project05.adoc[Project 5]
-**** xref:spring2022/19000/19000-s2022-project06.adoc[Project 6]
-**** xref:spring2022/19000/19000-s2022-project07.adoc[Project 7]
-**** xref:spring2022/19000/19000-s2022-project08.adoc[Project 8]
-**** xref:spring2022/19000/19000-s2022-project09.adoc[Project 9]
-**** xref:spring2022/19000/19000-s2022-project10.adoc[Project 10]
-**** xref:spring2022/19000/19000-s2022-project11.adoc[Project 11]
-**** xref:spring2022/19000/19000-s2022-project12.adoc[Project 12]
-**** xref:spring2022/19000/19000-s2022-project13.adoc[Project 13]
-**** xref:spring2022/19000/19000-s2022-project14.adoc[Project 14]
-*** xref:spring2022/29000/29000-s2022-projects.adoc[STAT 29000]
-**** xref:spring2022/29000/29000-s2022-project01.adoc[Project 1]
-**** xref:spring2022/29000/29000-s2022-project02.adoc[Project 2]
-**** xref:spring2022/29000/29000-s2022-project03.adoc[Project 3]
-**** xref:spring2022/29000/29000-s2022-project04.adoc[Project 4]
-**** xref:spring2022/29000/29000-s2022-project05.adoc[Project 5]
-**** xref:spring2022/29000/29000-s2022-project06.adoc[Project 6]
-**** xref:spring2022/29000/29000-s2022-project07.adoc[Project 7]
-**** xref:spring2022/29000/29000-s2022-project08.adoc[Project 8]
-**** xref:spring2022/29000/29000-s2022-project09.adoc[Project 9]
-**** xref:spring2022/29000/29000-s2022-project10.adoc[Project 10]
-**** xref:spring2022/29000/29000-s2022-project11.adoc[Project 11]
-**** xref:spring2022/29000/29000-s2022-project12.adoc[Project 12]
-**** xref:spring2022/29000/29000-s2022-project13.adoc[Project 13]
-**** xref:spring2022/29000/29000-s2022-project14.adoc[Project 14]
-*** xref:spring2022/39000/39000-s2022-projects.adoc[STAT 39000]
-**** xref:spring2022/39000/39000-s2022-project01.adoc[Project 1]
-**** xref:spring2022/39000/39000-s2022-project02.adoc[Project 2]
-**** xref:spring2022/39000/39000-s2022-project03.adoc[Project 3]
-**** xref:spring2022/39000/39000-s2022-project04.adoc[Project 4]
-**** xref:spring2022/39000/39000-s2022-project05.adoc[Project 5]
-**** xref:spring2022/39000/39000-s2022-project06.adoc[Project 6]
-**** xref:spring2022/39000/39000-s2022-project07.adoc[Project 7]
-**** xref:spring2022/39000/39000-s2022-project08.adoc[Project 8]
-**** xref:spring2022/39000/39000-s2022-project09.adoc[Project 9]
-**** xref:spring2022/39000/39000-s2022-project10.adoc[Project 10]
-**** xref:spring2022/39000/39000-s2022-project11.adoc[Project 11]
-**** xref:spring2022/39000/39000-s2022-project12.adoc[Project 12]
-**** xref:spring2022/39000/39000-s2022-project13.adoc[Project 13]
-**** xref:spring2022/39000/39000-s2022-project14.adoc[Project 14]
-** Fall 2022
-*** xref:fall2022/10100/10100-2022-projects.adoc[TDM 101]
-**** xref:fall2022/logistics/10100-2022-officehours.adoc[Office Hours]
-**** xref:fall2022/10100/10100-2022-project01.adoc[Project 1]
-**** xref:fall2022/10100/10100-2022-project02.adoc[Project 2]
-**** xref:fall2022/10100/10100-2022-project03.adoc[Project 3]
-**** xref:fall2022/10100/10100-2022-project04.adoc[Project 4]
-**** xref:fall2022/10100/10100-2022-project05.adoc[Project 5]
-**** xref:fall2022/10100/10100-2022-project06.adoc[Project 6]
-**** xref:fall2022/10100/10100-2022-project07.adoc[Project 7]
-**** xref:fall2022/10100/10100-2022-project08.adoc[Project 8]
-**** xref:fall2022/10100/10100-2022-project09.adoc[Project 9]
-**** xref:fall2022/10100/10100-2022-project10.adoc[Project 10]
-**** xref:fall2022/10100/10100-2022-project11.adoc[Project 11]
-**** xref:fall2022/10100/10100-2022-project12.adoc[Project 12]
-**** xref:fall2022/10100/10100-2022-project13.adoc[Project 13]
-*** xref:fall2022/20100/20100-2022-projects.adoc[TDM 201]
-**** xref:fall2022/logistics/20100-2022-officehours.adoc[Office Hours]
-**** xref:fall2022/20100/20100-2022-project01.adoc[Project 1]
-**** xref:fall2022/20100/20100-2022-project02.adoc[Project 2]
-**** xref:fall2022/20100/20100-2022-project03.adoc[Project 3]
-**** xref:fall2022/20100/20100-2022-project04.adoc[Project 4]
-**** xref:fall2022/20100/20100-2022-project05.adoc[Project 5]
-**** xref:fall2022/20100/20100-2022-project06.adoc[Project 6]
-**** xref:fall2022/20100/20100-2022-project07.adoc[Project 7]
-**** xref:fall2022/20100/20100-2022-project08.adoc[Project 8]
-**** xref:fall2022/20100/20100-2022-project09.adoc[Project 9]
-**** xref:fall2022/20100/20100-2022-project10.adoc[Project 10]
-**** xref:fall2022/20100/20100-2022-project11.adoc[Project 11]
-**** xref:fall2022/20100/20100-2022-project12.adoc[Project 12]
-**** xref:fall2022/20100/20100-2022-project13.adoc[Project 13]
-*** xref:fall2022/30100/30100-2022-projects.adoc[TDM 301]
-**** xref:fall2022/logistics/30100-2022-officehours.adoc[Office Hours]
-**** xref:fall2022/30100/30100-2022-project01.adoc[Project 1]
-**** xref:fall2022/30100/30100-2022-project02.adoc[Project 2]
-**** xref:fall2022/30100/30100-2022-project03.adoc[Project 3]
-**** xref:fall2022/30100/30100-2022-project04.adoc[Project 4]
-**** xref:fall2022/30100/30100-2022-project05.adoc[Project 5]
-**** xref:fall2022/30100/30100-2022-project06.adoc[Project 6]
-**** xref:fall2022/30100/30100-2022-project07.adoc[Project 7]
-**** xref:fall2022/30100/30100-2022-project08.adoc[Project 8]
-**** xref:fall2022/30100/30100-2022-project09.adoc[Project 9]
-**** xref:fall2022/30100/30100-2022-project10.adoc[Project 10]
-**** xref:fall2022/30100/30100-2022-project11.adoc[Project 11]
-**** xref:fall2022/30100/30100-2022-project12.adoc[Project 12]
-**** xref:fall2022/30100/30100-2022-project13.adoc[Project 13]
-*** xref:fall2022/40100/40100-2022-projects.adoc[TDM 401]
-**** xref:fall2022/logistics/40100-2022-officehours.adoc[Office Hours]
-**** xref:fall2022/40100/40100-2022-project01.adoc[Project 1]
-**** xref:fall2022/40100/40100-2022-project02.adoc[Project 2]
-**** xref:fall2022/40100/40100-2022-project03.adoc[Project 3]
-**** xref:fall2022/40100/40100-2022-project04.adoc[Project 4]
-**** xref:fall2022/40100/40100-2022-project05.adoc[Project 5]
-**** xref:fall2022/40100/40100-2022-project06.adoc[Project 6]
-**** xref:fall2022/40100/40100-2022-project07.adoc[Project 7]
-**** xref:fall2022/40100/40100-2022-project08.adoc[Project 8]
-**** xref:fall2022/40100/40100-2022-project09.adoc[Project 9]
-**** xref:fall2022/40100/40100-2022-project10.adoc[Project 10]
-**** xref:fall2022/40100/40100-2022-project11.adoc[Project 11]
-**** xref:fall2022/40100/40100-2022-project12.adoc[Project 12]
-**** xref:fall2022/40100/40100-2022-project13.adoc[Project 13]
-** Spring 2023
-*** xref:spring2023/10200/10200-2023-projects.adoc[TDM 102]
-**** xref:spring2023/logistics/TA/office_hours.adoc[Office Hours]
-**** xref:spring2023/10200/10200-2023-project01.adoc[Project 1]
-**** xref:spring2023/10200/10200-2023-project02.adoc[Project 2]
-**** xref:spring2023/10200/10200-2023-project03.adoc[Project 3]
-**** xref:spring2023/10200/10200-2023-project04.adoc[Project 4]
-**** xref:spring2023/10200/10200-2023-project05.adoc[Project 5]
-**** xref:spring2023/10200/10200-2023-project06.adoc[Project 6]
-**** xref:spring2023/10200/10200-2023-project07.adoc[Project 7]
-**** xref:spring2023/10200/10200-2023-project08.adoc[Project 8]
-**** xref:spring2023/10200/10200-2023-project09.adoc[Project 9]
-**** xref:spring2023/10200/10200-2023-project10.adoc[Project 10]
-**** xref:spring2023/10200/10200-2023-project11.adoc[Project 11]
-**** xref:spring2023/10200/10200-2023-project12.adoc[Project 12]
-**** xref:spring2023/10200/10200-2023-project13.adoc[Project 13]
-*** xref:spring2023/20200/20200-2023-projects.adoc[TDM 202]
-**** xref:spring2023/logistics/TA/office_hours.adoc[Office Hours]
-**** xref:spring2023/20200/20200-2023-project01.adoc[Project 1]
-**** xref:spring2023/20200/20200-2023-project02.adoc[Project 2]
-**** xref:spring2023/20200/20200-2023-project03.adoc[Project 3]
-**** xref:spring2023/20200/20200-2023-project04.adoc[Project 4]
-**** xref:spring2023/20200/20200-2023-project05.adoc[Project 5]
-**** xref:spring2023/20200/20200-2023-project06.adoc[Project 6]
-**** xref:spring2023/20200/20200-2023-project07.adoc[Project 7]
-**** xref:spring2023/20200/20200-2023-project08.adoc[Project 8]
-**** xref:spring2023/20200/20200-2023-project09.adoc[Project 9]
-**** xref:spring2023/20200/20200-2023-project10.adoc[Project 10]
-**** xref:spring2023/20200/20200-2023-project11.adoc[Project 11]
-**** xref:spring2023/20200/20200-2023-project12.adoc[Project 12]
-**** xref:spring2023/20200/20200-2023-project13.adoc[Project 13]
-*** xref:spring2023/30200/30200-2023-projects.adoc[TDM 302]
-**** xref:spring2023/logistics/TA/office_hours.adoc[Office Hours]
-**** xref:spring2023/30200/30200-2023-project01.adoc[Project 1]
-**** xref:spring2023/30200/30200-2023-project02.adoc[Project 2]
-**** xref:spring2023/30200/30200-2023-project03.adoc[Project 3]
-**** xref:spring2023/30200/30200-2023-project04.adoc[Project 4]
-**** xref:spring2023/30200/30200-2023-project05.adoc[Project 5]
-**** xref:spring2023/30200/30200-2023-project06.adoc[Project 6]
-**** xref:spring2023/30200/30200-2023-project07.adoc[Project 7]
-**** xref:spring2023/30200/30200-2023-project08.adoc[Project 8]
-**** xref:spring2023/30200/30200-2023-project09.adoc[Project 9]
-**** xref:spring2023/30200/30200-2023-project10.adoc[Project 10]
-**** xref:spring2023/30200/30200-2023-project11.adoc[Project 11]
-**** xref:spring2023/30200/30200-2023-project12.adoc[Project 12]
-**** xref:spring2023/30200/30200-2023-project13.adoc[Project 13]
-*** xref:spring2023/40200/40200-2023-projects.adoc[TDM 402]
-**** xref:spring2023/logistics/TA/office_hours.adoc[Office Hours]
-**** xref:spring2023/40200/40200-2023-project01.adoc[Project 1]
-**** xref:spring2023/40200/40200-2023-project02.adoc[Project 2]
-**** xref:spring2023/40200/40200-2023-project03.adoc[Project 3]
-**** xref:spring2023/40200/40200-2023-project04.adoc[Project 4]
-**** xref:spring2023/40200/40200-2023-project05.adoc[Project 5]
-**** xref:spring2023/40200/40200-2023-project06.adoc[Project 6]
-**** xref:spring2023/40200/40200-2023-project07.adoc[Project 7]
-**** xref:spring2023/40200/40200-2023-project08.adoc[Project 8]
-**** xref:spring2023/40200/40200-2023-project09.adoc[Project 9]
-**** xref:spring2023/40200/40200-2023-project10.adoc[Project 10]
-**** xref:spring2023/40200/40200-2023-project11.adoc[Project 11]
-**** xref:spring2023/40200/40200-2023-project12.adoc[Project 12]
-**** xref:spring2023/40200/40200-2023-project13.adoc[Project 13]
-** Fall 2023
-*** xref:fall2023/10100/10100-2023-projects.adoc[TDM 101]
-**** xref:fall2023/logistics/office_hours_101.adoc[Office Hours]
-**** xref:fall2023/10100/10100-2023-project01.adoc[Project 1]
-**** xref:fall2023/10100/10100-2023-project02.adoc[Project 2]
-**** xref:fall2023/10100/10100-2023-project03.adoc[Project 3]
-**** xref:fall2023/10100/10100-2023-project04.adoc[Project 4]
-**** xref:fall2023/10100/10100-2023-project05.adoc[Project 5]
-**** xref:fall2023/10100/10100-2023-project06.adoc[Project 6]
-**** xref:fall2023/10100/10100-2023-project07.adoc[Project 7]
-**** xref:fall2023/10100/10100-2023-project08.adoc[Project 8]
-**** xref:fall2023/10100/10100-2023-project09.adoc[Project 9]
-**** xref:fall2023/10100/10100-2023-project10.adoc[Project 10]
-**** xref:fall2023/10100/10100-2023-project11.adoc[Project 11]
-**** xref:fall2023/10100/10100-2023-project12.adoc[Project 12]
-**** xref:fall2023/10100/10100-2023-project13.adoc[Project 13]
-*** xref:fall2023/20100/20100-2023-projects.adoc[TDM 201]
-**** xref:fall2023/logistics/office_hours_201.adoc[Office Hours]
-**** xref:fall2023/20100/20100-2023-project01.adoc[Project 1]
-**** xref:fall2023/20100/20100-2023-project02.adoc[Project 2]
-**** xref:fall2023/20100/20100-2023-project03.adoc[Project 3]
-**** xref:fall2023/20100/20100-2023-project04.adoc[Project 4]
-**** xref:fall2023/20100/20100-2023-project05.adoc[Project 5]
-**** xref:fall2023/20100/20100-2023-project06.adoc[Project 6]
-**** xref:fall2023/20100/20100-2023-project07.adoc[Project 7]
-**** xref:fall2023/20100/20100-2023-project08.adoc[Project 8]
-**** xref:fall2023/20100/20100-2023-project09.adoc[Project 9]
-**** xref:fall2023/20100/20100-2023-project10.adoc[Project 10]
-**** xref:fall2023/20100/20100-2023-project11.adoc[Project 11]
-**** xref:fall2023/20100/20100-2023-project12.adoc[Project 12]
-**** xref:fall2023/20100/20100-2023-project13.adoc[Project 13]
-*** xref:fall2023/30100/30100-2023-projects.adoc[TDM 301]
-**** xref:fall2023/logistics/office_hours_301.adoc[Office Hours]
-**** xref:fall2023/30100/30100-2023-project01.adoc[Project 1]
-**** xref:fall2023/30100/30100-2023-project02.adoc[Project 2]
-**** xref:fall2023/30100/30100-2023-project03.adoc[Project 3]
-**** xref:fall2023/30100/30100-2023-project04.adoc[Project 4]
-**** xref:fall2023/30100/30100-2023-project05.adoc[Project 5]
-**** xref:fall2023/30100/30100-2023-project06.adoc[Project 6]
-**** xref:fall2023/30100/30100-2023-project07.adoc[Project 7]
-**** xref:fall2023/30100/30100-2023-project08.adoc[Project 8]
-**** xref:fall2023/30100/30100-2023-project09.adoc[Project 9]
-**** xref:fall2023/30100/30100-2023-project10.adoc[Project 10]
-**** xref:fall2023/30100/30100-2023-project11.adoc[Project 11]
-**** xref:fall2023/30100/30100-2023-project12.adoc[Project 12]
-**** xref:fall2023/30100/30100-2023-project13.adoc[Project 13]
-*** xref:fall2023/40100/40100-2023-projects.adoc[TDM 401]
-**** xref:fall2023/logistics/office_hours_401.adoc[Office Hours]
-**** xref:fall2023/40100/40100-2023-project01.adoc[Project 1]
-**** xref:fall2023/40100/40100-2023-project02.adoc[Project 2]
-**** xref:fall2023/40100/40100-2023-project03.adoc[Project 3]
-**** xref:fall2023/40100/40100-2023-project04.adoc[Project 4]
-**** xref:fall2023/40100/40100-2023-project05.adoc[Project 5]
-**** xref:fall2023/40100/40100-2023-project06.adoc[Project 6]
-**** xref:fall2023/40100/40100-2023-project07.adoc[Project 7]
-**** xref:fall2023/40100/40100-2023-project08.adoc[Project 8]
-**** xref:fall2023/40100/40100-2023-project09.adoc[Project 9]
-**** xref:fall2023/40100/40100-2023-project10.adoc[Project 10]
-**** xref:fall2023/40100/40100-2023-project11.adoc[Project 11]
-**** xref:fall2023/40100/40100-2023-project12.adoc[Project 12]
-**** xref:fall2023/40100/40100-2023-project13.adoc[Project 13]
-** Spring 2024
-*** xref:spring2024/10200/10200-2024-projects.adoc[TDM 10200]
-**** xref:spring2024/10200/10200-2024-project01.adoc[Project 1]
-**** xref:spring2024/10200/10200-2024-project02.adoc[Project 2]
-**** xref:spring2024/10200/10200-2024-project03.adoc[Project 3]
-**** xref:spring2024/10200/10200-2024-project04.adoc[Project 4]
-**** xref:spring2024/10200/10200-2024-project05.adoc[Project 5]
-**** xref:spring2024/10200/10200-2024-project06.adoc[Project 6]
-**** xref:spring2024/10200/10200-2024-project07.adoc[Project 7]
-**** xref:spring2024/10200/10200-2024-project08.adoc[Project 8]
-**** xref:spring2024/10200/10200-2024-project09.adoc[Project 9]
-**** xref:spring2024/10200/10200-2024-project10.adoc[Project 10]
-**** xref:spring2024/10200/10200-2024-project11.adoc[Project 11]
-**** xref:spring2024/10200/10200-2024-project12.adoc[Project 12]
-**** xref:spring2024/10200/10200-2024-project13.adoc[Project 13]
-**** xref:spring2024/10200/10200-2024-project14.adoc[Project 14]
-*** xref:spring2024/20200/20200-2024-projects.adoc[TDM 20200]
-**** xref:spring2024/20200/20200-2024-project01.adoc[Project 1]
-**** xref:spring2024/20200/20200-2024-project02.adoc[Project 2]
-**** xref:spring2024/20200/20200-2024-project03.adoc[Project 3]
-**** xref:spring2024/20200/20200-2024-project04.adoc[Project 4]
-**** xref:spring2024/20200/20200-2024-project05.adoc[Project 5]
-**** xref:spring2024/20200/20200-2024-project06.adoc[Project 6]
-**** xref:spring2024/20200/20200-2024-project07.adoc[Project 7]
-**** xref:spring2024/20200/20200-2024-project08.adoc[Project 8]
-**** xref:spring2024/20200/20200-2024-project09.adoc[Project 9]
-**** xref:spring2024/20200/20200-2024-project10.adoc[Project 10]
-**** xref:spring2024/20200/20200-2024-project11.adoc[Project 11]
-**** xref:spring2024/20200/20200-2024-project12.adoc[Project 12]
-**** xref:spring2024/20200/20200-2024-project13.adoc[Project 13]
-**** xref:spring2024/20200/20200-2024-project14.adoc[Project 14]
-*** xref:spring2024/30200_40200/30200-2024-projects.adoc[TDM 30200]
-*** xref:spring2024/30200_40200/40200-2024-projects.adoc[TDM 40200]
-** Think Summer 2024
-*** xref:summer2024/summer-2024-account-setup.adoc[Account Setup]
-*** xref:summer2024/summer-2024-project-template.adoc[Project Template]
-*** xref:summer2024/summer-2024-project-introduction.adoc[Introduction]
-*** xref:summer2024/summer-2024-day1-notes.adoc[Day 1 Notes]
-*** xref:summer2024/summer-2024-day2-notes.adoc[Day 2 Notes]
-*** xref:summer2024/summer-2024-day3-notes.adoc[Day 3 Notes]
-*** xref:summer2024/summer-2024-day4-notes.adoc[Day 4 Notes]
-*** xref:summer2024/summer-2024-day5-notes.adoc[Day 5 Notes]
-*** xref:summer2024/summer-2024-project-01.adoc[Project 1]
-*** xref:summer2024/summer-2024-project-02.adoc[Project 2]
-*** xref:summer2024/summer-2024-project-03.adoc[Project 3]
-*** xref:summer2024/summer-2024-project-04.adoc[Project 4]
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2018/tdm201819projects.adoc b/projects-appendix/modules/ROOT/pages/fall2018/tdm201819projects.adoc
deleted file mode 100644
index 4424f189e..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2018/tdm201819projects.adoc
+++ /dev/null
@@ -1,2269 +0,0 @@
-= TDM Fall 2018 STAT 19000 Projects
-
-== Project 1
-
-Question 1.
-
-Use the airline data stored in this directory:
-
-`/depot/statclass/data/dataexpo2009`
-
-In the year 2005, find:
-
-a. the number of flights that occurred, on every day of the year, and
-
-b. find the day of the year on which the most flights occur.
-
-Solution:
-
-We switch to the directory for the airline data
-
-`cd /depot/statclass/data/dataexpo2009`
-
-a. The number of flights that occurred, on every day of the year, can be obtained by extracting the 1st, 2nd, and 3rd fields, sorting the data, and then summarizing the data using the uniq command with the -c flag
-
-`sort 2005.csv | cut -d, -f1-3 | sort | uniq -c`
-
-The first few lines of the output are:
-
-[source,bash]
-----
-16477 2005,10,1
-19885 2005,10,10
-19515 2005,10,11
-19701 2005,10,12
-19883 2005,10,13
-----
-
-and the last few lines of the output are:
-
-[source,bash]
-----
-20051 2005,9,6
-19629 2005,9,7
-19968 2005,9,8
-19938 2005,9,9
- 1 Year,Month,DayofMonth
-----
-
-b. The day of the year on which the most flights occur can be found by sorting the results above, in numerical order, using sort -n and then (if desired, although it is optional) we can extract the last line of the output using tail -n1
-
-`sort 2005.csv | cut -d, -f1-3 | sort | uniq -c | sort -n | tail -n1`
-
-and we conclude that the most flights occur on August 5:
-
-`21041 2005,8,5`
-
-
-Question 2.
-
-Again considering the year 2005, did United or Delta have more flights?
-
-Solution:
-
-We can extract the 9th field, which is the carrier (i.e., the airline company) and then, in the same way as above, we can sort the data, and then we can summarize the data using uniq -c
-
-This yields the number of flights for each carrier. We can either read the number of United or Delta flights with our eyeballs, or we can use the grep command, searching for both the pattern UA and DL to isolate (only) the number of flights for United and Delta, respectively.
-
-`sort 2005.csv | cut -d, -f9 | sort | uniq -c | grep "UA\|DL"`
-
-The output is
-
-[source,bash]
-----
-658302 DL
-485918 UA
-----
-
-so Delta has more flights than United in 2005.
-
-
-Question 3.
-
-Consider the June 2017 taxi cab data, which is located in this folder:
-
-`/depot/statclass/data/taxi2018`
-
-What is the distribution of the number of passengers in the taxi cab rides? In other words, make a list of the number of rides that have 1 passenger; that have 2 passengers; etc.
-
-Solution:
-
-Now we change directories to consider the taxi cab data
-
-`cd ../taxi2018`
-
-The ".." in the previous command just indicates that we want to go up one level to
-
-`/depot/statclass/data`
-
-and then, from that point, we want to go into the taxi cab directory. If this sounds complicated, then (instead) it is safe to use the longer version:
-
-`cd /depot/statclass/data/taxi2018`
-
-The number of passengers is given in the 4th column, `passenger_count`
-
-We use a method that is similar to the one from the first three questions, we extract the 4th column, sort the data, and then summarizing the data using the uniq command with the -c flag
-
-`sort yellow_tripdata_2017-06.csv | cut -d, -f4 | sort | uniq -c`
-
-and the distribution of the number of passengers is:
-
-[source,bash]
-----
- 1
- 548 0
-6933189 1
-1385066 2
- 406162 3
- 187979 4
- 455753 5
- 288220 6
- 26 7
- 30 8
- 20 9
- 1 passenger_count
-----
-
-Notice that we have some extraneous information, i.e., there is one blank line and also one line for the passenger_count (from the header)
-
-
-== Project 2
-
-Question 1.
-
-Use the airline data stored in this directory:
-
-`/depot/statclass/data/dataexpo2009`
-
-a. What was the average arrival delay (in minutes) for flights in 2005?
-
-b. What was the average departure delay (in minutes) for flights in 2005?
-
-cd. Now revise your solution to 1ab, to account for the delays (of both types) in the full set of data, across all years.
-
-
-Question 2.
-
-Revise your solutions to 1abcd to only include flights that took place on the weekends.
-
-Question 3.
-
-Consider the June 2017 taxi cab data, which is located in this folder:
-
-`/depot/statclass/data/taxi2018`
-
-What is the average distance of a taxi cab ride in New York City in June 2017?
-
-
-== Project 3
-
-Use R to revisit these questions. They can each be accomplished with 1 line of code.
-
-Question 1.
-
-As in Project 1, question 2: In the year 2005, did United or Delta have more flights?
-
-Question 2.
-
-As in Project 2, question 2a: Restricting attention to weekends (only), what was the average arrival delay (in minutes) for flights in 2005?
-
-Question 3.
-
-As in Project 1, question 3: In June 2017, what is the distribution of the number of passengers in the taxi cab rides?
-
-Question 4.
-
-As in Project 2, question 3: What is the average distance of a taxi cab ride in New York City in June 2017?
-
-
-
-
-== Project 4
-
-Revisit the map code on the STAT 19000 webpage:
-
-http://www.stat.purdue.edu/datamine/19000/
-
-Goal: Make a map of the State of Indiana, which shows all of Indiana's airports.
-
-Notes:
-
-You will need to install the ggmap package, which takes a few minutes to install.
-
-You can read in the data about the airports from the Data Expo 2009 Supplementary Data:
-
-http://stat-computing.org/dataexpo/2009/supplemental-data.html
-
-It will be necessary to extract (only) the airports with "state" equal to "IN"
-
-It is possible to either dynamically load the longitude and latitude of Indianapolis from Google,
-
-or to manually specify the longitude and latitude (e.g., by looking them up yourself in Google and entering them).
-
-After you plot the State of Indiana with all of the airports shown,
-
-you can print the resulting plot to a pdf file as follows:
-
-dev.print(pdf, "filename.pdf")
-
-Please submit your GitHub code in a ".R" file and also the resulting ".pdf" file.
-
-It is not (yet) necessary to submit your work in RMarkdown.
-
-
-
-== Project 5
-
-Question 1.
-
-a. Compute the average distance for the flights on each airline in 2005.
-
-b. Sort the result from 1a, and make a dotchart to display the results in sorted order. (Please display all of the values in the dotchart.)
-
-Hint: You can use:
-
-`?dotchart`
-
-if you want to read more about how to make a dotchart about the data.
-
-
-Question 2.
-
-a. Compute the average total amount of the cost of taxi rides in June 2017, for each pickup location ID. You can see which variables have the total amount of the cost of the ride, as well as the pickup location ID, if you look at the data dictionary for the yellow taxi cab rides, which you can download here: `http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml`
-
-b. Sort the result from 2a, and make a dotchart to display the results in sorted order. (Please ONLY display the results with value bigger than 80.)
-
-Question 3.
-
-Put the two questions above -- including your comments -- into an RMarkdown file. Submit the .Rmd file itself and either the html or pdf output, when you submit your project in GitHub.
-
-
-
-== Project 6
-
-Consider the election donation data:
-
-https://www.fec.gov/data/advanced/?tab=bulk-data
-
-from "Contributions by individuals" for 2017-18. Download this data.
-
-
-Unzip the file (in the terminal).
-
-Use the cat command to concatenate all of the files in the by_date folder into one large file (in the terminal).
-
-Read the data dictionary:
-
-https://www.fec.gov/campaign-finance-data/contributions-individuals-file-description/
-
-
-Hint: When working with a file that is not comma separated, you can use the read.delim command in R, and *be sure to specify* the character that separates the various pieces of data on a row.
-To do this, you can read the help file for read.delim by typing: ?read.delim
-(Look for the "field separator character".)
-
-Also there is no header, so also use header=F
-
-
-Question 1.
-
-Rank the states according to how many times that their citizens contributed (i.e., total number of donations). Which 5 states made the largest numbers of contributions?
-
-Question 2.
-
-Use awk in the terminal to verify your solution to question 1.
-
-Question 3.
-
-Now (instead) rank the states according to how much money their citizens contributed (i.e., total amount of donations). Which 5 states contributed the largest amount of money?
-
-(Optional!!) challenge question: Use awk in the terminal to verify your solution to question 3.
-This can be done with 1 line of awk code, but you need to use arrays in awk,
-as demonstrated (for instance) on Andrey's solution on this page:
-
-https://unix.stackexchange.com/questions/242946/using-awk-to-sum-the-values-of-a-column-based-on-the-values-of-another-column/242949
-
-Submit your solutions in RMarkdown.
-For question 2 (and for the optional challenge question), it is OK to just
-put your code into your comments in RMarkdown,
-so that the TA's can see how you solved question 2,
-but (of course) the awk code does not run in RMarkdown!
-You are just showing the awk code to the TA's in this way!
-
-
-== Project 7
-
-Consider the Lahman baseball database available at:
-http://www.seanlahman.com/baseball-archive/statistics/
-
-Download the 2017 comma-delimited version and unzip it.
-Inside the "core" folder of the unzipped file, you will find many csv files.
-
-If you want to better understand the contents of the files,
-there is a helpful readme file available here:
-http://www.seanlahman.com/files/database/readme2017.txt
-
-Question 1.
-
-Use the Batting.csv file (inside the "core" folder) to discover who is a member of the 40-40 club, namely, who has hit 40 home runs and also has (simultaneously) stolen 40 bases in the same season.
-Hint: There are multiple ways to solve this question. It is not necessary to use a tapply function. This can be done with one line of code.
-
-Question 2.
-
-Make a plot that depicts the total number of home runs per year (across all players on all teams). The plot should have the years as the labels for the x-axis, and should have the number of home runs as the labels for the y-axis.
-Hints: Use the tapply function. Save the results of the tapply function in a vector v. If do this, then names(v) will have a list of the years. The plot command has options that include xlab and ylab, so that you can put intelligent labels on the axes, for instance, you can label the x-axis as "years" and the y-axis as "HR".
-
-Question 3.
-
-a. Try this example: Store the Batting table into a data frame called myBatting. Store the People table into a date frame called myPeople. Merge the two data frames into a new data frame, using the "merge" function: `myDF <- merge(myBatting, myPeople, by="playerID")`
-
-b. Use the paste command to paste the first and last name columns from myDF into a new vector. Save this new vector as a new column in the data frame myDF.
-
-c. Return to question 1, and resolve it. Now we can see the person's full name instead of their playerID.
-
-
-
-Fun Side Project (to accompany Project 7)
-
-Not required, but fun!
-
-read `Teams.csv` file into a `data.frame` called myDF
-
-break the data.frame into smaller data frames,
-according to the `teamID`, using this code:
-
-`by(myDF, myDF$teamID, function(x) {plot(x$W)} )`
-
-For each team, this draws 1 plot of the number of wins per year. The number of wins will be on the y-axis of the plots.
-
-For an improved version, we can add the years on the x-axis, as follows:
-
-`by(myDF, myDF$teamID, function(x) {plot(x$year, x$W)} )`
-
-Change your working directory in R to a new folder, using the menu option:
-
-`Session -> Set Working Directory -> Choose Directory`
-
-We are going to make 149 new plots!
-
-After changing the directory, try this code, which makes 149 separate pdf files:
-
-`by(myDF, myDF$teamID, function(x) {pdf(as.character(x$teamID[1])); plot(x$year, x$W); dev.off()} )`
-
-
-== SQL Example 1
-
-We only need to install this package 1 time.
-
-`install.packages("RMySQL")`
-
-No need to run the line above, if you already ran it.
-
-We need to run this library every time we load R.
-
-[source,r]
-----
-library("RMySQL")
-myconnection <- dbConnect(dbDriver("MySQL"),
- host="mydb.ics.purdue.edu",
- username="mdw_guest",
- password="MDW_csp2018",
- dbname="mdw")
-
-easyquery <- function(x) {
- fetch(dbSendQuery(myconnection, x), n=-1)
-}
-----
-
-Here are the players from the Boston Red Sox in the year 2008
-
-[source,r]
-----
-myDF <- easyquery("SELECT m.playerID, b.yearID, b.teamID,
- m.nameFirst, m.nameLast
- FROM Batting b JOIN Master m
- ON b.playerID = m.playerID
- WHERE b.teamID = 'BOS'
- AND b.yearID = 2008;")
-myDF
-----
-
-== SQL Example 2
-
-We only need to install this package 1 time.
-
-`install.packages("RMySQL")`
-
-No need to run the line above, if you already ran it.
-
-We need to run this library every time we load R.
-
-[source,r]
-----
-library("RMySQL")
-myconnection <- dbConnect(dbDriver("MySQL"),
- host="mydb.ics.purdue.edu",
- username="mdw_guest",
- password="MDW_csp2018",
- dbname="mdw")
-
-easyquery <- function(x) {
- fetch(dbSendQuery(myconnection, x), n=-1)
-}
-----
-
-Here are the total number of home runs hit by each player in their entire career
-
-[source,r]
-----
-myDF <- easyquery("SELECT m.nameFirst, m.nameLast,
- b.playerID, SUM(b.HR)
- FROM Batting b JOIN Master m
- ON m.playerID = b.playerID
- GROUP BY b.playerID;")
-
-myDF
-----
-
-Here are the players who hit more than 600 home runs in their careers
-
-`myDF[ myDF$"SUM(b.HR)" >= 600, ]`
-
-== SQL Example 3
-
-We only need to install this package 1 time.
-
-`install.packages("RMySQL")`
-
-No need to run the line above, if you already ran it.
-
-We need to run this library every time we load R.
-
-[source,r]
-----
-library("RMySQL")
-myconnection <- dbConnect(dbDriver("MySQL"),
- host="mydb.ics.purdue.edu",
- username="mdw_guest",
- password="MDW_csp2018",
- dbname="mdw")
-
-easyquery <- function(x) {
- fetch(dbSendQuery(myconnection, x), n=-1)
-}
-----
-
-Here is basic version for the players who have more than 60 Home Runs during one season.
-
-[source,r]
-----
-myDF <- easyquery("SELECT b.playerID, b.yearID, b.HR
- FROM Batting b
- WHERE b.HR >= 60;")
-
-myDF
-----
-
-Here is an improved version, which includes the Batting and the Master table, so that we can have the players' full names.
-
-[source,r]
-----
-myDF <- easyquery("SELECT m.nameFirst, m.nameLast,
- b.playerID, b.yearID, b.HR
- FROM Master m JOIN Batting b
- ON m.playerID = b.playerID
- WHERE b.HR >= 60;")
-
-myDF
-----
-
-== SQL Example 4
-
-We only need to install this package 1 time.
-
-`install.packages("RMySQL")`
-
-No need to run the line above, if you already ran it.
-
-We need to run this library every time we load R.
-
-[source,r]
-----
-library("RMySQL")
-myconnection <- dbConnect(dbDriver("MySQL"),
- host="mydb.ics.purdue.edu",
- username="mdw_guest",
- password="MDW_csp2018",
- dbname="mdw")
-
-easyquery <- function(x) {
- fetch(dbSendQuery(myconnection, x), n=-1)
-}
-----
-
-Here is basic version for the 40-40 club question. (Same question as last week.)
-
-[source,r]
-----
-myDF <- easyquery("SELECT b.playerID, b.yearID, b.SB, b.HR
- FROM Batting b
- WHERE b.SB >= 40 AND b.HR >= 40;")
-
-myDF
-----
-
-Here is an improved version, which includes the Batting and the Master table, so that we can have the players' full names.
-
-[source,r]
-----
-myDF <- easyquery("SELECT m.nameFirst, m.nameLast,
- b.yearID, b.SB, b.HR
- FROM Master m JOIN Batting b
- ON m.playerID = b.playerID
- WHERE b.SB >= 40 AND b.HR >= 40;")
-
-myDF
-----
-
-Here is a further improved version, which includes the Batting, Master, and Teams table, so that we can have the players' full names, and the teams that they played on.
-
-[source,r]
-----
-myDF <- easyquery("SELECT m.nameFirst, m.nameLast,
- b.yearID, b.SB, b.HR, t.name
- FROM Master m JOIN Batting b
- ON m.playerID = b.playerID
- JOIN Teams t
- ON b.yearID = t.yearID
- AND b.teamID = t.teamID
- WHERE b.SB >= 40 AND b.HR >= 40;")
-myDF
-----
-
-
-
-
-== Project 8
-
-Question 1.
-
-Modify SQL Example 2 to find the Pitcher who has the most Strikeouts in his career.
-
-Hint: You need to use a "Pitching p" table instead of a "Batting b" table.
-
-Hint: The strikeouts are in column "SO" of the Pitching table.
-
-Hint: This pitcher is named "Nolan Ryan"... but you need to use SQL to figure that out.
-
-I am just trying to give you a way to know when you are correct.
-
-Please momentarily forget that I am giving you the answer at the start!
-
-Question 2.
-
-Which years was Nolan Ryan a pitcher?
-
-For this project, to make your life easier, it is OK to just submit a regular R file, rather than an RMarkdown file.
-
-
-== Project 9
-
-(Please remember that you have a "ReadMe" file, posted on Piazza last week, which tells you about all of the tables, including the table that tells you where the students went to school.)
-
-1. Find the first and last names of all players who attended Purdue.
-
-2. Find all of the pitchers who have pitched 300 or more strikeouts during a single season.
-
-In the output, give their first and last name and the year in which this achievement occurred.
-(You can just modify Example 3.)
-
-3a. Modify Example 5 to find out which pitchers were able to achieve 300 or more strikeouts AND 20 or more wins during the same season.
-
-3b. Consider the years in which this achievement occurred. Use R to find the list of distinct years in which this achievement occurred at least once.
-
-Background discussion:
-
-If you look at the example for the 40-40 club (in Example 4), it works because each time that a player achieved 40 (or more) HR's and 40 (or more) SB's during the same season, he was only playing for one team. A player never got traded to a new team, in any of those years. Some complications will arise if a player switches teams (i.e., gets traded) during the season. For this reason, we introduce Example 5.
-
-Here are some notes about Example 5:
-
-If we incorporate the SUM function into a condition, for instance, `WHERE SUM(b.SB) >= 40` the query will not work. Instead, if the condition has a `SUM` inside it, we change `WHERE` to `HAVING`. See Example 5 as a perfect example of this. We can also return the results in a given order, using: `ORDER BY` for instance, `ORDER BY by.yearID` if we want to get the results (say) in order by the year.
-
-== SQL Example 5
-
-We only need to install this package 1 time.
-
-`install.packages("RMySQL")`
-
- No need to run the line above, if you already ran it.
-
-We need to run this library every time we load R.
-
-`library("RMySQL")`
-
-[source,r]
-----
-myconnection <- dbConnect(dbDriver("MySQL"),
- host="mydb.ics.purdue.edu",
- username="mdw_guest",
- password="MDW_csp2018",
- dbname="mdw")
-
-easyquery <- function(x) {
- fetch(dbSendQuery(myconnection, x), n=-1)
-}
-----
-
-Here is basic version for the 30-30 club question.
-(Same question as last week.)
-
-
-[source,r]
-----
-myDF <- easyquery("SELECT b.playerID, b.yearID, SUM(b.SB), SUM(b.HR)
- FROM Batting b
- GROUP BY b.playerID, b.yearID
- HAVING SUM(b.SB) >= 30 AND SUM(b.HR) >= 30
- ORDER BY b.yearID;")
-myDF
-----
-
-Here is an improved version, which includes the Batting and the Master table,
-so that we can have the players' full names.
-
-
-[source,r]
-----
-myDF <- easyquery("SELECT m.nameFirst, m.nameLast,
- b.yearID, SUM(b.SB), SUM(b.HR)
- FROM Master m JOIN Batting b
- ON m.playerID = b.playerID
- GROUP BY b.playerID, b.yearID
- HAVING SUM(b.SB) >= 30 AND SUM(b.HR) >= 30
- ORDER BY b.yearID;")
-myDF
-----
-
-
-== Project 10
-
-Use the results of the National Park Service scraping example to answer the following two questions:
-
-1. Which states have at least 20 NPS properties?
-
-2. One zip code has 13 properties in the same zip code! What are the names of those 13 properties?
-
-If you want to learn XPath (as demonstrated in the case study) to scrape data from a website of your choice, you can make up the grades from 1 or 2 of the previous projects. If you scrape at least 500 pieces of data from the XML of a page,you can replace the grade from 1 previous project. If you scrape at least 1000 pieces of data from the XML of a page, you can replace the grade from 2 previous projects. Your project plan will require written approval from Dr Ward, and it will require you to scrape the data from XML itself (not just download the data).
-
-case study: scraping National Park Service data
-
-[source,r]
-----
-# This is a short project to download the data about the
-# properties in the National Park Service (NPS).
-# They are all online through the office NPS webpage:
-# https://www.nps.gov/findapark/index.htm
-# (Please note that some parks extend into more than one state.)
-
-# At the end of the project, when we export the data,
-# we do not want to use comma-separated values (i.e., a csv file)
-# because there are also some commas in our data.
-# So we will use tabs as our delimiter at the end of this process.
-
-# We will use the RCurl package to download the NPS files.
-# Normally we could just parse the XML (or html) content
-# on-the-fly, without downloading the files, but in this case,
-# it wasn't working on about 10 of the files, and somehow
-# when I downloaded the files, it worked completely.
-# I tried this several times, and just going ahead and downloading
-# the files seems to be the most consistent solution.
-install.packages("RCurl")
-library(RCurl)
-
-# We will use the XML package to parse the html (or XML) data
-install.packages("XML")
-library(XML)
-
-# We will use the xlsx package to export the results at the end,
-# into an xlsx file, for viewing in Microsoft Excel, if desired.
-install.packages("xlsx")
-library(xlsx)
-
-# To see the list of the parks, we can go here:
-# https://www.nps.gov/findapark/index.htm
-# in any browser.
-# In most browsers, if you navigate to a page and then type:
-# Control-U (i.e., the Control Key and the letter U Key at once)
-# on a Windows or UNIX machine,
-# or if you type Command-U (i.e., the Command Key and the letter U Key at once)
-# on an Apple Macintosh machine,
-# then you can see the code for the way that the webpage is created.
-
-# This webpage that I mentioned:
-# https://www.nps.gov/findapark/index.htm
-# has 1489 lines of code. Wow.
-
-
-# From (roughly) lines 206 through 756, we see that the
-# data for the parks are wrapped in a "div" (on line 206)
-# and then in a "select" (on line 208)
-# and then in an "optgroup" and then an "option".
-# We want to extract the "value" of each "option".
-# (We skip the "label" on line 205 because it ends on line 205 too.)
-# So we do the following:
-
-myparks <- xpathSApply(htmlParse(getURL("https://www.nps.gov/findapark/index.htm")), "//*/div/select/optgroup/option", xmlGetAttr, "value")
-myparks
-
-# If the line of code (above) doesn't work,
-# then perhaps you forgot to actually run the three "library" commands
-# near the start of the file.
-
-# We did a lot of things with 1 line of code.
-# The "getURL" temporarily downloads all of the code from this webpage.
-# We do not save the webpage, but rather, we send it to the htmlParse command.
-# Once the page is parsed, we send the parsed results to the xpathSApply command.
-# The pattern we want to look for is:
-# "//*/div/select/optgroup/option"
-# The star means that anything is OK before this chunk of the pattern,
-# but we definitely want our pattern to end with /div/select/optgroup/option
-# and then we get the xmlGetAttr attribute called "value"
-# which is one of the parks.
-
-# When we check the results, we got 498 results:
-length(myparks)
-
-# For the Abraham Lincoln Birthplace, we want to run the following command,
-# so that we are prepared to download the webpage.
-# After downloading it, we will extract information from the parsed page:
-system("mkdir ~/Desktop/myparks/")
-download.file("https://www.nps.gov/abli/index.htm", "~/Desktop/myparks/abli.htm")
-htmlParse("~/Desktop/myparks/abli.htm")
-
-# but we want to do that for each park.
-# So we build the following function:
-myparser <- function(x) {
- download.file(paste("https://www.nps.gov/", x, "/index.htm", sep=""), paste("~/Desktop/myparks/", x, ".htm", sep=""))
- htmlParse(paste("~/Desktop/myparks/", x, ".htm", sep=""))
-}
-
-# Now, we apply this function to each element of "myparks"
-# and we save the results in a variable called "mydocs":
-mydocs <- sapply(myparks, myparser)
-
-# The webpage for the Abraham Lincoln Birthplace is now parsed and stored here:
-mydocs[[1]]
-# The webpage for Zion National Park is now parsed and stored here:
-mydocs[[498]]
-
-# Next we look at the source for the Abraham Lincoln Birthplace:
-# https://www.nps.gov/abli/index.htm
-# We load that webpage in any browser and then type:
-# Control-U if we are on a Windows or UNIX machine, or
-# Command-U if we are on a Mac.
-
-# Then we can search in this page (using Control-F on Windows or UNIX,
-# or using Command-F on a Mac) for any pattern we want.
-# If we search for "itemprop"
-# we find the information about the address:
-
-# They are all within a "span" tag, with different "itemprop" attributes:
-# The street address has attribute: "streetAddress"
-# The city has attribute: "addressLocality"
-# The state has attribute: "addressRegion"
-# The zip code has attribute: "postalCode"
-# The telephone has attribute: "telephone"
-
-# So, for instance, we can find all of these as follows:
-xpathSApply(mydocs[[1]], "//*/span[@itemprop='streetAddress']", xmlValue)
-xpathSApply(mydocs[[1]], "//*/span[@itemprop='addressLocality']", xmlValue)
-xpathSApply(mydocs[[1]], "//*/span[@itemprop='addressRegion']", xmlValue)
-xpathSApply(mydocs[[1]], "//*/span[@itemprop='postalCode']", xmlValue)
-xpathSApply(mydocs[[1]], "//*/span[@itemprop='telephone']", xmlValue)
-
-# Then the title stuff:
-
-xpathSApply(mydocs[[1]], "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue)
-xpathSApply(mydocs[[1]], "//*/span[@class='Hero-designation']", xmlValue)
-xpathSApply(mydocs[[1]], "//*/span[@class='Hero-location']", xmlValue)
-
-# and, finally, the social media links:
-
-paste(xpathSApply(mydocs[[1]], "//*/div/ul/li[@class='col-xs-6 col-sm-12 col-md-6']/a", xmlGetAttr, "href"),collapse=",")
-
-
-# Here are the versions for the entire data set:
-
-streets <- sapply(mydocs, function(x) xpathSApply(x, "//*/span[@itemprop='streetAddress']", xmlValue))
-cities <- sapply(mydocs, function(x) xpathSApply(x, "//*/span[@itemprop='addressLocality']", xmlValue))
-states <- sapply(mydocs, function(x) xpathSApply(x, "//*/span[@itemprop='addressRegion']", xmlValue))
-zips <- sapply(mydocs, function(x) xpathSApply(x, "//*/span[@itemprop='postalCode']", xmlValue))
-phones <- sapply(mydocs, function(x) xpathSApply(x, "//*/span[@itemprop='telephone']", xmlValue))
-
-mynames <- sapply(mydocs, function(x) xpathSApply(x, "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue))
-mytypes <- sapply(mydocs, function(x) xpathSApply(x, "//*/span[@class='Hero-designation']", xmlValue))
-mylocations <- sapply(mydocs, function(x) xpathSApply(x, "//*/span[@class='Hero-location']", xmlValue))
-
-mylinks <- sapply(mydocs, function(x) paste(xpathSApply(x, "//*/div/ul/li[@class='col-xs-6 col-sm-12 col-md-6']/a", xmlGetAttr, "href"),collapse=","))
-
-# with some cleaning up:
-
-streets <- sapply(streets, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE)
-cities <- sapply(cities, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE)
-states <- sapply(states, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE)
-zips <- sapply(zips, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE)
-phones <- sapply(phones, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE)
-mynames <- sapply(mynames, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE)
-mytypes <- sapply(mytypes, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE)
-mylocations <- sapply(mylocations, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE)
-mylinks <- sapply(mylinks, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE)
-
-myDF <- data.frame(
-streets=do.call(rbind,streets),
-cities=do.call(rbind,cities),
-states=do.call(rbind,states),
-zips=do.call(rbind,zips),
-phones=do.call(rbind,phones),
-mynames=do.call(rbind,mynames),
-mytypes=do.call(rbind,mytypes),
-mylocations=do.call(rbind,mylocations),
-mylinks=do.call(rbind,mylinks)
-)
-----
-
-== Project 11:
-
-The names in the election data are in CAPITAL LETTERS!
-
-When asking about names in the questions, we assume that you are using the names from the election data, available on Scholar.
-
-You might want to practice on a smaller data set: `/depot/statclass/data/election2018/itsmall.txt`
-
-The full data is available here: `/depot/statclass/data/election2018/itcont.txt`
-
-We are assuming that you are using unique names from column 8, i.e., that you have already removed duplicates of any names of the donors.
-
-Hint: Save column 8 (which contains the donor names) into a new variable. Then extract the unique values from the column using the "unique" command.
-
-Answer these questions using the full data given above. BUT, for convenience, you might want to *start* by using the smaller data set to practice.
-
-Please note that we can read the data into R using the command:
-
-`myDF <- read.csv("/depot/statclass/data/election2018/itsmall.txt", header=F, sep="|")`
-
-or, for the full data set:
-
-`myDF <- read.csv("/depot/statclass/data/election2018/itcont.txt", header=F, sep="|")`
-
-1. Find the number of (unique) donor names who have your first name,
- embedded somewhere in the donor's name (not necessarily as the
- first or last name--any location is OK).
-
-2. a. How many donors have a consecutive repeated letter in their name? b. How many donors have a consecutive repeated vowel in their name? c. How many donors have a consecutive repeated consonant in their name?
-
-3. Just for fun: Come up with an interesting question about text patterns, and answer it yourself, using regular expressions. Of course you can compare questions and answers with another member of The Data Mine. Have fun!
-
-[source,bash]
-----
-
-Regular expressions enable us to find patterns in text.
-Here are a handful of examples of regular expressions.
-
-The best way to learn them in earnest is to just read some documentation about regular expressions and then try them!
-
-Here is an example:
-v <- c("me", "you", "mark", "laura", "kale", "emma", "err", "eat", "queue", "kangaroo", "kangarooooo", "kangarooooooooo")
-
-The elements of v that contain the letter "m":
-v[grep("m", v)]
-
-containing the phrase "me":
-v[grep("me", v)]
-
-containing the letter "a":
-v[grep("a", v)]
-
-containing the letter "e":
-v[grep("e", v)]
-
-containing the letter "k":
-v[grep("k", v)]
-
-containing the letter "k" at the start of the word:
-v[grep("^k", v)]
-
-containing the letter "k" at the end of the word:
-v[grep("k$", v)]
-
-containing the letter "a" at the end of the word:
-v[grep("a$", v)]
-
-containing the letter "o" at the end of the word:
-v[grep("o$", v)]
-
-containing the letter "o" anywhere in the word:
-v[grep("o", v)]
-
-containing the letter "o" two times in a row, anywhere in the word:
-v[grep("o{2}", v)]
-
-containing the letter "o" three times in a row, anywhere in the word:
-v[grep("o{3}", v)]
-
-containing the letter "o" two to five times in a row, anywhere in the word:
-v[grep("o{2,5}", v)]
-
-containing the letter "q" followed by "ue":
-v[grep("q(ue){1}", v)]
-
-containing the letter "q" followed by "ue" two times:
-v[grep("q(ue){2}", v)]
-
-containing the letter "q" followed by "ue" three times:
-v[grep("q(ue){3}", v)]
-
-containing the letter "e" followed by "m" or "r":
-v[grep("e(m|r)", v)]
-
-again, same idea, but different way, to find words
-containing the letter "e" followed by "m" or "r":
-v[grep("e[mr]", v)]
-
-containing the letter "e" followed by "ma" or "rr":
-v[grep("e(ma|rr)", v)]
-
-containing a repeated letter:
-v[grep("([a-z])\\1", v)]
-In this example, the \\1 refers to whatever was found in the first match
-(which is just given in parentheses for convenience)
-
-Here is a summary of regular expressions:
-
-https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
-
-You are welcome to use any source or reference for regular expressions that you like.
-
-We need to use double backslash for back-references, in R.
-We gave a demonstration of this, in the last example given above.
-In general, in R, when writing a backslash in a regular expression, a double backslash is usually needed.
-----
-
-
-== Project 12
-
-There was no project 12
-
-== Project 13
-
-There was no project 13
-
-== Project 14
-
-Remind ourselves how to use bash and awk tools (previously we did this in the terminal).
-
-We will do it in Jupyter Notebooks this semester: `http://notebook.scholar.rcac.purdue.edu/`
-
-1. a. Start a new Jupyter Notebook with type "bash" (instead of "R"). We are going to put bash code directly inside the Jupyter Notebook. (In the past, we only wrote bash code directly inside the terminal.) b. Look at the first 10 lines of the 2007 flight data, which is found at: `/depot/statclass/data/dataexpo2009/2007.csv` All of the flights in those first 10 lines are on the same carrier. Which carrier is it? Remember that you can check: `http://stat-computing.org/dataexpo/2009/the-data.html` Now we are going to put awk code directly inside the Jupyter Notebook. (In the past, we only wrote awk code directly inside the terminal.)
-
-2. Save the information about every flight departing from Indianapolis since January 1, 2000 into a common file, named `MyIndyFlights.csv`
-
-Hint 1: You only need the files 2000.csv, 2001.csv, ..., 2008.csv You can work on all of those files at once, using 2*.csv because the "*" is like a wildcard, that matches any pattern.
-
-Hint 2: You can use awk to do this. For comparison, ONLY as an example, we can extract all flights
-that are on Delta airlines in 1998 as follows:
-`cat /depot/statclass/data/dataexpo2009/1998.csv | awk -F, '{ if($9 == "DL") {print $0} }' >MyDeltaFlights.csv`
-
-== Project 14 Solutions
-
-
-[source,bash]
-----
-# 1. The head of the file with the 2007 flights is:
-head /depot/statclass/data/dataexpo2009/2007.csv
-
-# We see that the UniqueCarrier is found in column 9.
-# One way to extract the UniqueCarrier is with the cut command
-# using a comma as the delimiter and retrieving (cut out) the 9th column:
-cut -d, -f9 /depot/statclass/data/dataexpo2009/2007.csv | head -n11
-# We only displayed the head, because we only want the first 10 flights.
-# We specified -n11 because this prints the first 11 lines of the file,
-# namely, the header itself, and the first 10 flights.
-# We can check the data dictionary, available at: http://stat-computing.org/dataexpo/2009/
-# The information about the carrier codes is found there,
-# by clicking on the link for supplemental data sources: http://stat-computing.org/dataexpo/2009/supplemental-data.html
-and then choosing the carriers file: http://stat-computing.org/dataexpo/2009/carriers.csv
-# The carrier code "WN" for each of these first ten flights is Southwest.
-
-# 2. We save the information about the Indianapolis flights by using awk.
-# First we recall how to see the information about all such flights.
-# Here are the first 10 lines of that data.
-cat /depot/statclass/data/dataexpo2009/2*.csv | head
-# Then we change the "head" to the "awk" command.
-# We use comma as the field separator
-# (this is the same as the role of the delimiter from cut)
-# We modify the example from the project assignment,
-# so that we focus on the 17th field (which are the Origin airports)
-# and we save the resulting data into a file called MyIndyFlights.csv
-
-cat /depot/statclass/data/dataexpo2009/2*.csv | awk -F, '{ if($17 == "IND") {print $0} }' >MyIndyFlights.csv
-# Some of you were not working in your home directory when you ran this commmand.
-# If you want to be sure to save the file into your home directory,
-# remember that you can explicitly specify your home directory using a tilde, as follows:
-cat /depot/statclass/data/dataexpo2009/2*.csv | awk -F, '{ if($17 == "IND") {print $0} }' >~/MyIndyFlights.csv
-# It is not required that you check things,
-# but if you want to check that things worked properly, you can use the wc command
-# which gives the number of lines, words, and bytes in the resulting file:
-wc MyIndyFlights.csv
-# or, even more explicitly,
-wc ~/MyIndyFlights.csv
-# An alternative is to check the head and the tail:
-head MyIndyFlights.csv
-tail MyIndyFlights.csv
-# or, even more explicitly,
-head ~/MyIndyFlights.csv
-tail ~/MyIndyFlights.csv
-----
-
-== Project 15
-
-Remind ourselves how to use R tools (previously we did this in the terminal). We will do it in Jupyter Notebooks this semester.
-
-Question 1
-
-a. Start a new Jupyter Notebook with type "R"
-
-b. Import the flight data from the file MyIndyFlights.csv in a data frame. You just created this file in Project 14. It contains all of the flights that departed from Indianapolis since January 1, 2000. (There should be 356561 flights altogether, and there is no header.) Hint: When you import the data, if you use the read.csv command, there is no header, so be sure to use header=FALSE.
-
-c. What are the five most popular destinations for travelers who depart Indianapolis since January 1, 2000? List each of these 5 destinations, and the number of flights to each one.
-
-
-Question 2
-
-a. Consider the year 2005 (only). Tabulate the number of flights per day.
-
-b. On each of the most popular five days, how many flights are there?
-
-c. On each of the least popular five days, how many flights are there?
-
-Hint: You might be surprised to see the wide range of the number of flights per day!
-
-== Project 15 Solutions
-
-
-[source,R]
-----
-
-# 1. We first import the flight data from the file MyIndyFlights.csv
-
-myDF <- read.csv("MyIndyFlights.csv", header=F)
-
-# or, if you prefer to explicitly state that the file
-# is in your home directory, you can add the tilde for your home:
-
-myDF <- read.csv("~/MyIndyFlights.csv", header=F)
-
-# We check that there are 356561 flights altogether:
-
-dim(myDF)
-
-# The five most popular destinations for travelers
-# who depart Indianapolis since January 1, 2000 are:
-tail(sort(table(myDF[[18]])),n=5)
-
-# We used the 18th column, which has the Destination airports.
-# We tabulated the results, using the table command,
-# and then we sorted the results.
-# Finally, at the end, we took the tail of the results,
-# using n=5, since we wanted to see the largest 5 values.
-
-# 2a. We load the 2005 data:
-
-myDF <- read.csv("/depot/statclass/data/dataexpo2009/2005.csv")
-
-# To get the number of flights per day,
-# we can first paste together the Month and Day columns.
-# We check the head, to make sure that this worked:
-
-head(paste(myDF$Month, myDF$DayofMonth))
-
-# It is also possible, for instance, to separate the
-# month and the day by separators, such as a slash:
-
-head(paste(myDF$Month, myDF$DayofMonth, sep="/"))
-
-# or a dash:
-
-head(paste(myDF$Month, myDF$DayofMonth, sep="-"))
-
-# Now we can tabulate the number of flights per day,
-# using the table command:
-
-table(paste(myDF$Month, myDF$DayofMonth, sep="/"))
-
-# 2b. To find the most popular five days,
-# we can sort the table, and then consider the tail,
-# using the n=5 option,
-# since we only want the 5 most popular dates.
-
-tail(sort(table(paste(myDF$Month, myDF$DayofMonth, sep="/"))),n=5)
-
-# 2c. We just change tail to head,
-# to find the 5 least popular dates:
-
-head(sort(table(paste(myDF$Month, myDF$DayofMonth, sep="/"))),n=5)
-
-----
-
-
-== Project 16
-
-Project 16 needs to be saved as a `.ipynb` file. This is different from the previous two assignments where the file was uploaded directly from each students Github page. Students need to download it from this link. Thanks!
-
-https://raw.githubusercontent.com/TheDataMine/STAT-19000/master/Assignments/hw16.ipynb
-
-Question 1
-
-Consider the flights from 2005 in the Data Expo 2009 data set. The actual departure times, as you know, are given in the DepTime column. In this question, we want to categorize the departure times according to the hour of departure. For instance, any time in the 4 o'clock in the (very early morning) hour should be classified together. These are the times between 0400 and 0459 (because the times are given in military time). One way to do this is to divide each of the times by 100, and then to take the "floor" of the results, and then make a "table" of the results. For practice (just to understand things), give this a try with the head of the DepTime, one step at a time, to make sure that you understand what is happening. Then: a. Classify all of the 2005 departure times, according to the hour of departure, using this method. b. During which hour of the day did the most flights depart?
-
-Question 2
-
-a. Here is another way to solve the question above. Read the documentation for the "cut" command. For the "breaks" parameter, use:
-seq(0, 2900, by=100)
-and be sure to set the parameter "right" to be FALSE.
-
-b. Check that you get the same result as in question 1, using this method.
-
-c. Why did we choose to use 2900 instead of (say) 2400 in this method?
-
-== Project 16 Solutions
-
-
-[source,R]
-----
-# 1a. We read the data from the 2005 flights into a data frame
-
-myDF <- read.csv("/depot/statclass/data/dataexpo2009/2005.csv")
-
-# Then we divide each time by 100 and take the floor:
-
-table(floor(myDF$DepTime/100))
-
-# and we get:
-
-# 0 1 2 3 4 5 6 7 8 9 10
-# 21747 7092 2027 458 1610 114469 430723 440532 469386 447705 432526
-# 11 12 13 14 15 16 17 18 19 20 21
-# 446432 443252 440903 416661 441021 424299 457678 431613 390398 321680 235810
-# 22 23 24 25 26 27 28
-# 128382 58386 1711 301 56 7 1
-
-# 1b. The most flights departed during 8 AM to 9 AM;
-
-sort(table(floor(myDF$DepTime/100)))
-
-# 28 27 26 25 3 4 24 2 1 0 23
-# 1 7 56 301 458 1610 1711 2027 7092 21747 58386
-# 5 22 21 20 19 14 16 6 18 10 7
-# 114469 128382 235810 321680 390398 416661 424299 430723 431613 432526 440532
-# 13 15 12 11 9 17 8
-# 440903 441021 443252 446432 447705 457678 469386
-
-# 2a. We cut the DepTime column, using the breaks of 0000 through 2900
-
-table(cut(myDF$DepTime, breaks=seq(0000,2900,by=100), right=FALSE))
-
-# and we get:
-
-# [0,100) [100,200) [200,300) [300,400)
-# 21747 7092 2027 458
-# [400,500) [500,600) [600,700) [700,800)
-# 1610 114469 430723 440532
-# [800,900) [900,1e+03) [1e+03,1.1e+03) [1.1e+03,1.2e+03)
-# 469386 447705 432526 446432
-# [1.2e+03,1.3e+03) [1.3e+03,1.4e+03) [1.4e+03,1.5e+03) [1.5e+03,1.6e+03)
-# 443252 440903 416661 441021
-# [1.6e+03,1.7e+03) [1.7e+03,1.8e+03) [1.8e+03,1.9e+03) [1.9e+03,2e+03)
-# 424299 457678 431613 390398
-# [2e+03,2.1e+03) [2.1e+03,2.2e+03) [2.2e+03,2.3e+03) [2.3e+03,2.4e+03)
-# 321680 235810 128382 58386
-# [2.4e+03,2.5e+03) [2.5e+03,2.6e+03) [2.6e+03,2.7e+03) [2.7e+03,2.8e+03)
-# 1711 301 56 7
-# [2.8e+03,2.9e+03)
-# 1
-
-# or if you want to re-format the output, you can write, for instance:
-
-table(cut(myDF$DepTime, breaks=seq(0000,2900,by=100), dig.lab=4, right=FALSE))
-
-# [0,100) [100,200) [200,300) [300,400) [400,500) [500,600)
-# 21747 7092 2027 458 1610 114469
-# [600,700) [700,800) [800,900) [900,1000) [1000,1100) [1100,1200)
-# 430723 440532 469386 447705 432526 446432
-# [1200,1300) [1300,1400) [1400,1500) [1500,1600) [1600,1700) [1700,1800)
-# 443252 440903 416661 441021 424299 457678
-# [1800,1900) [1900,2000) [2000,2100) [2100,2200) [2200,2300) [2300,2400)
-# 431613 390398 321680 235810 128382 58386
-# [2400,2500) [2500,2600) [2600,2700) [2700,2800) [2800,2900)
-# 1711 301 56 7 1
-
-# We just sort the command above, and we see that
-# the most flights departed during 8 AM to 9 AM
-
-sort(table(cut(myDF$DepTime, breaks=seq(0000,2900,by=100), dig.lab=4, right=FALSE)))
-
-# [2800,2900) [2700,2800) [2600,2700) [2500,2600) [300,400) [400,500)
-# 1 7 56 301 458 1610
-# [2400,2500) [200,300) [100,200) [0,100) [2300,2400) [500,600)
-# 1711 2027 7092 21747 58386 114469
-# [2200,2300) [2100,2200) [2000,2100) [1900,2000) [1400,1500) [1600,1700)
-# 128382 235810 321680 390398 416661 424299
-# [600,700) [1800,1900) [1000,1100) [700,800) [1300,1400) [1500,1600)
-# 430723 431613 432526 440532 440903 441021
-# [1200,1300) [1100,1200) [900,1000) [1700,1800) [800,900)
-# 443252 446432 447705 457678 469386
-
-# 2b. We do get the same results as in question 1.
-
-# 2c. We choose to use 2900 instead of (say) 2400 in this method
-# because some flights departed after midnight.
-# The time stamps are between 0000 and 2400
-# (this is like military time, between 00:00 and 24:00).
-# Some flights have delays until after midnight,
-# and they are recorded in a surprising way,
-# e.g., 24:30 for 30 minutes past midnight,
-# or 26:10 for 2 hours and 10 minutes past midnight.
-# In our data set, it happens that all of the ranges of the times
-# are between 0000 and 2900. I just checked the max to find that out.
-# So that's why we use 2900 as an upper boundary, instead of 2400.
-
-max(myDF$DepTime, na.rm=T)
-
-----
-
-
-== Project 17
-
-Please download this template and use it to submit your solutions to GitHub:
-
-https://raw.githubusercontent.com/TheDataMine/STAT-19000/master/Assignments/hw17.ipynb
-
-Recall the 2018 election data, available here: `/depot/statclass/data/election2018/itcont.txt`
-
-and the data dictionary for this data, which is available here:
-
-https://www.fec.gov/campaign-finance-data/contributions-individuals-file-description
-
-Question 1
-
-a. Use the system command in R to read the data for the first 100,000 donations and store this data into a file called: shortfile.txt (We use .txt instead of .csv because the file is not comma delimited.)
-
-b. Use the read.csv command to read this data into a data frame in R, called: myDF (Hint: check the help for read.csv: ?read.csv to remind yourself about the "sep" and the "header" parameters for read.csv. In particular, this data has "|" as the separator between the data elements, and it does not have a header.)
-
-c. Check the dimension of the resulting data frame. It should be 100,000 rows and 21 columns.
-
-Question 2
-
-a. Split the data for these 100,000 donations according to the State from which the donation was given. Store the resulting data in a list called: myresult (Hint: Check the data dictionary for the meanings of the columns, since we do not have column headers.) (Another hint: Remember that we can refer to a column of data in a data frame by its number, for instance, myDF[[8]] is the name of the donor.)
-
-b. Check the names of myresult: names(myresult) We see the the first element of the list does not have a name. This is a pain! To solve this, you can give it a name, for instance, by writing: names(myresult)[1] <- "unknown" (or any other kind of name that you want, to indicate that the name is unknown)
-
-Question 3
-
-a. Find the mean donation amount, according to each state.
-
-b. What is the mean donation from Hoosiers (i.e., for people from Indiana)?
-
-c. Find the standard deviation of the donation amount, according to each state.
-
-d. Find the number of donations, according to each state.
-
-e. For a sanity check, make sure that the number of donations in 3d adds up to 100,000 altogether.
-
-Example
-
-[source,R]
-----
-# Remember that we can make system calls from R.
-# For instance, we can take the first 50000 lines of a file
-# and store them into a new file called shortfile.csv
-# To do this, we use the "system" command in R.
-# it basically enables us to run terminal commands
-# while we are still working in R.
-
-# This is an especially handy technique,
-# because the operating system itself is much faster than R.
-
-system("head -n50000 /depot/statclass/data/dataexpo2009/2005.csv >shortfile.csv")
-
-# Now we can read this (much shorter!) file into R.
-
-myDF <- read.csv("shortfile.csv")
-
-# It has data about only 49,999 flights because the header itself
-# counts as one of the 50,000 lines that we extracted.
-
-dim(myDF)
-
-# We can check to make sure that the read.csv worked,
-# by examining the head of myDF:
-
-head(myDF)
-
-# Within myDF, we can break the data into pieces,
-# according to (say) the Origin airport.
-# The split command can easily do that for us.
-# We give the split command 2 pieces of data:
-# 1. The data that should be split, and
-# 2. The way that the data is classified into pieces.
-# So, for instance, we can split the DepDelays
-# into pieces, based on the Origin.
-
-myresult <- split(myDF$DepDelay, myDF$Origin)
-
-# If we check the length of the result, it is 93:
-
-length(myresult)
-
-# because there are DepDelays from 93 airports.
-
-# The type of data is a "list".
-
-class(myresult)
-
-# We have not (yet) worked with lists,
-# but they are a lot like data frames.
-# The difference is that each column can have a different length.
-
-# For example, here are the first six columns
-# of the list:
-
-head(myresult)
-
-# The flights to Albuquerque are found in the second column:
-
-myresult$ABQ
-
-# or we can get this data by just asking directly for the second column,
-# without knowing the name of the column:
-
-myresult[[2]]
-
-# Now we can use the power of the apply functions that R provides.
-# You are already familiar with the tapply function.
-# Another very commonly used apply function is called "sapply".
-
-# We use sapply to apply a function to each part of a collection of data.
-
-# For example, remember that myresult has 93 parts:
-
-length(myresult)
-
-# We can take the mean of the data in each element of myresult
-# by applying the function "mean" to each element, as follows:
-
-sapply(myresult, mean)
-
-# Unfortunately, many of the results are NA's, so we can use na.rm=T
-
-sapply(myresult, mean, na.rm=T)
-
-# We can apply many functions to myresult in this way.
-
-# For instance, here is the variance of each part of the data in myresult:
-
-sapply(myresult, var, na.rm=T)
-
-# or the standard deviation:
-
-sapply(myresult, sd, na.rm=T)
-
-# Here is the number of flights from each Origin airport:
-
-sapply(myresult, length)
-
-# If we add up the number of flights, we better get 49,999:
-
-sum(sapply(myresult, length))
-
-# It is worthwhile to experience with sapply.
-# For instance, for something fun to try,
-# you can (simultaneously) make a plot of the DepDelays
-# from each of the 93 airports, as follows:
-
-sapply(myresult, plot)
-
-# This runs the "plot" function on each piece of data,
-# in other words, on the data from each Origin airport.
-
-# You can see the first 6 DepDelays from each Origin airport, as follows:
-
-sapply(myresult, head)
-
-# This is taking the "head" of each part of the data.
-----
-
-== Project 17 Solutions
-
-[source,R]
-----
-# 1a. We first store the first 100,000 donations into a file
-# called shortfile.txt
-# using the system command
-
-system("head -n100000 /depot/statclass/data/election2018/itcont.txt >~/shortfile.txt")
-
-# 1b. Now we import this data into the read.csv file
-
-myDF <- read.csv("~/shortfile.txt", header=F, sep="|")
-
-# 1c. The resulting data frame has 100000 rows and 21 columns, as it should!
-
-dim(myDF)
-
-# 2a. Now we split the data for the donations according to the State
-# from which the donation was given
-
-myresult <- split(myDF$V15, myDF$V10)
-
-# 2b. We check the names of myresult:
-
-names(myresult)
-
-# and the first element of the list does not have a name.
-# so we give it a name, for instance, by writing:
-
-names(myresult)[1] <- "unknown"
-
-# 3a. The mean donation amount, from each state,
-# can be found using the sapply command:
-
-sapply(myresult, mean, na.rm=T)
-
-# 3b. The mean donation from Indiana can be found
-# by extracting the entry with the name "IN"
-
-sapply(myresult, mean, na.rm=T)["IN"]
-
-# and we get: IN: 367.914678899083
-
-# 3c. The standard deviation of the donation amount for each state is:
-
-sapply(myresult, sd, na.rm=T)
-
-# 3d. The number of donations per state can be found by
-# checking the length of the vector of donations from each state:
-
-sapply(myresult, length)
-
-# 3e. For our sanity check, we see that yes, indeed,
-# the total number of donations is 100,000:
-
-sum(sapply(myresult, length))
-
-----
-
-
-== Project 18
-
-Here is the Project 18 template:
-
-https://raw.githubusercontent.com/TheDataMine/STAT-19000/master/Assignments/hw18.ipynb
-
-Consider the election data stored at: `/depot/statclass/data/election2018/itcont.txt`
-
-The data set is very large. You might choose to analyze a smaller portion of the data initially, and then to run your code on the full data set, once you have the code working correctly.
-
-Sometimes there will be warnings in Jupyter Notebooks, and you need to scroll past the warnings, to see the results of your analysis. This is a known issue with Jupyter Notebooks, and other people are experiencing it too:
-
-https://github.com/IRkernel/IRkernel/issues/590
-
-Recall that the data dictionary for the data is found here:
-
-https://www.fec.gov/campaign-finance-data/contributions-individuals-file-description
-
-Question 1
-
-a. The first column contains the "Filer identification number" for various committees. Which of these committees received the largest monetary amount of donations?
-
-b. Use the tapply function to make a matrix whose rows correspond to states, whose columns correspond to the "filer identification numbers" of committees, and whose entries contain the total amount of the donations given to the committees, by donors from each individual state. (Hint: Wrap the states and the filer identification numbers into a list.) Print the block of first 10 rows and 10 columns, so that the TA's can see the results of your work.
-
-Question 2
-
-For this question, be sure to take into account the city and state (together).
-
-a. Identify the six cities that made the largest number of donations.
-
-b. Identify the six cities that made the largest monetary amount of funding donated.
-
-Question 3
-
-a. Split the data (using the split command) about the donations, according to the day when the transaction was made. Once this split is accomplished, use the sapply function to find the following:
-
-b. On which day was the total monetary amount of donations the largest?
-
-c. On which day was the largest number of donations made?
-
-
-[source,R]
-----
-# Examples that might help with Project 18 (but are using the airline data set)
-# We can read in the 2005 flight data:
-myDF <- read.csv("/depot/statclass/data/dataexpo2009/2005.csv")
-# and verify that we got it read in properly, using the head:
-head(myDF)
-# We can find the mean DepDelays, according to the Origin and Destination (simultaneously).
-# This puts the Origins on the rows and the Destinations on the columns.
-tapply(myDF$DepDelay, list(myDF$Origin, myDF$Dest), mean, na.rm=T)
-# If you just want to see the first 10 rows and columns,
-# you can save the results to a variable:
-myresult <- tapply(myDF$DepDelay, list(myDF$Origin, myDF$Dest), mean, na.rm=T)
-# and then load the rows and columns that you want to see:
-myresult[1:10,1:10]
-# Many are NA because you can't always get from one city to another.
-# You can lookup specify Origins and Destinations as follows:
-myresult[c("DEN","ORD","JFK"),c("BOS","IAD","ATL")]
-# Those are flights from Origin "DEN" or "ORD" or "JFK" to Destinations "BOS" or "IAD" or "ATL"
-# Here is another example:
-# We can split all of the data about the DepDelays, according to the date.
-# To do this, I first need to make a column that contains the dates,
-# since the airport data doesn't have such a column (yet):
-myDF$completedates <- paste(myDF$Month, myDF$DayofMonth, myDF$Year, sep="/")
-# Then we split the DepDelays, according to the dates:
-mydelays <- split(myDF$DepDelay, myDF$completedates)
-# This gives us a list:
-class(mydelays)
-# Of course the length is 365, because there are 365 days per year:
-length(mydelays)
-# Here are the delays from Christmas Day:
-mydelays["12/25/2005"]
-# Now we can easily use the sapply function on
-# the DepDelay data, which has already been grouped according to the days.
-# Here is the mean DepDelay on each day:
-sapply(mydelays, mean, na.rm=T)
-# Here is the standard deviation of the DepDelay, on each day:
-sapply(mydelays, sd, na.rm=T)
-# Here is the length of each piece of the data,
-# i.e., the number of pieces of data per day.
-# (This is obviously equal to the number of flights per day too,
-# because each flights has *some kind* of delay!)
-sapply(mydelays, length)
-# Project 18 Solutions:
-# 1a. The committee C00401224 received $565007473 in donations altogether.
-tail(sort(tapply(myDF$V15, myDF$V1, sum, na.rm=T)))
-# Here are the top six committees, according to the total monetary donations:
-# C00000935 109336606
-# C00571703 114336858
-# C00003418 116712977
-# C00484642 130390881
-# C00504530 133582635
-# C00401224 565007473
-
-# 1b. We first build a matrix with the data from the states (column 10) on the rows
-# and the data from the committees (column 1) on the columns.
-# Each entry have the analogous sum of the sum of the donations.
-myresult <- tapply(myDF$V15, list(myDF$V10, myDF$V1), sum, na.rm=T)
-# Now we display the results of the first 10 rows and 10 columns:
-myresult[1:10,1:10]
-# C00000059 C00000422 C00000638 C00000729 C00000885 C00000901 C00000935 C00000984 C00001016 C00001180
-# NA NA NA NA NA NA 174182 NA NA NA
-# AA NA NA NA NA NA NA 15336 NA NA NA
-# AE NA NA NA NA NA NA 13122 NA NA NA
-# AK NA 5148 NA 4384 1985 23135 175850 NA 8674 NA
-# AL NA 7152 NA 9722 1868 103106 407595 NA 13518 NA
-# AP NA NA NA NA NA NA 4705 NA NA NA
-# AR 420 9994 NA 5750 1406 13910 183457 5000 12730 NA
-# AS NA NA NA NA NA NA NA NA NA NA
-# AZ NA 10074 NA 26778 615 17040 1223310 NA 12488 NA
-# CA NA 89498 NA 41752 31705 108253 28039517 5000 256676 NA
-
-# 2a. We paste together the city and state data using the paste function.
-# Then we tabulate the number of such donations, according to these city-state pairs.
-# Finally, we sort these counts and print the six largest ones, using the tail function.
-tail(sort(table(paste(myDF$V9,myDF$V10))))
-
-# 2b. We paste together the city and state data using the paste function.
-# Then we add the monetary amount of the donations (from column 15),
-# according to these city-state pairs.
-# Finally, we sort these total monetary amounts and
-# print the six largest ones, using the tail function.
-tail(sort(tapply(myDF$V15,paste(myDF$V9,myDF$V10),sum,na.rm=T)))
-
-# 3a. We split the data about the donation amounts (from column 15),
-# according to the day on which the donations were made.
-myresult <- split(myDF$V15, myDF$V14)
-
-# 3b. Now we sum the monetary amount of the donations, for each day:
-tail(sort(sapply(myresult, sum, na.rm=T)))
-
-# 3c. Alternatively, we see how many donations were made on each day,
-# by finding the length of the vector that has the donations for that day,
-# i.e., by finding how many donations there were for each day.
-tail(sort(sapply(myresult, length)))
-
-----
-
-== Project 19
-
-There was no Project 19
-
-
-== Project 20
-
-Please submit your answers, when you are finished, using GitHub. We put an RMarkdown file into your individual GitHub accounts, for this purpose.
-
-Notes about scraping data:
-
-As a gentle reminder about how to access RStudio:
-
-Log on to Scholar:
-
-https://desktop.scholar.rcac.purdue.edu
-
-(or use the ThinLinc client on your computer if you installed it!)
-
-open the terminal on Scholar and type:
-
-[source,bash]
-----
-module load gcc/5.2.0
-module load rstudio
-rstudio &
-----
-
-Please remember to install and load the XML and the RCurl libraries.
-
-Using RStudio, we start to learn how to extract data from the web.
-
-Use the data from the Billboard Hot 100 for question 1.
-
-Please use the data from the week you were born. For instance, if I solve question 1, I would use the data located here:
-
-https://www.billboard.com/charts/hot-100/1976-10-13
-
-Question 1
-
-On the Hot 100 chart, from the day of your birth:
-
-a. Extract the titles of the songs ranked #2 through #100.
-
-b. Extract the artists for those 99 songs.
-
-c. Extract the title of the number 1 song for that day.
-
-d. Extract the artist for the number 1 song for that day.
-
-Question 2
-
-a. Extract the city where the National Park property for Catoctin Mountain is located. This data is found at: `https://www.nps.gov/cato/index.htm` or in the file: `/depot/statclass/data/parks/cato.htm`
-
-b. Extract the state where Catoctin Mountain is located.
-
-c. Extract the zip code where Catoctin Mountain is located.
-
-Question 3
-
-a. Identify three potential websites that you are interested to try to scrape yourself, during the upcoming seminars. Look for websites with data that is (relatively) easy to scrape, for instance: Systematic URL’s that are easy to understand; (relative) consistency in how the data is stored; and make sure that the data is embedded in the page, rather than in csv files that are already prepared for download. (We want to actually scrape some data.)
-
-b. For each of the three websites that you identified, give a very brief description of the kind of data that you want to scrape.
-
-== Project 20 Billboard Example
-
-[source,R]
-----
-install.packages("XML")
-library(XML)
-
-# Considering the songs and artists who sang popular songs at the time of my birthday in 1976, we can scrape some data from Billboard Hot 100 chart
-
-# Here are the songs titles #2 through #100 from my birthday
-
-# Please notice the double underscore before title:
-
-xpathSApply(htmlParse(getURL("https://www.billboard.com/charts/hot-100/1976-10-13")),
-"//*/div[@class='chart-list-item__title']", xmlValue)
-
-# Here are the artists of the songs #2 through #100 from my birthday
-
-# Please notice the double underscore before artist:
-
-xpathSApply(htmlParse(getURL("https://www.billboard.com/charts/hot-100/1976-10-13")),
-"//*/div[@class='chart-list-item__artist']", xmlValue)
-----
-
-== Project 20 National Park Service Example
-
-[source,R]
-----
-
-# We will use the XML package to parse html (or XML) data
-install.packages("XML")
-library(XML)
-
-# and the RCurl package if you want to pull the data directly from the web:
-install.packages("RCurl")
-library(RCurl)
-
-# To see the list of the parks, we can go here:
-# https://www.nps.gov/findapark/index.htm
-# if you use Control-U
-# (i.e., the Control Key and the letter U Key at once)
-# then you can see the code
-# for the way that the webpage is created.
-
-# You can use Firefox to open any of the files
-# with the data from the state parks;
-# they are all found inside this directory:
-# /depot/statclass/data/parks/
-
-########################################
-# To study a specific park,
-# we look at the source for the Abraham Lincoln Birthplace:
-# https://www.nps.gov/abli/index.htm
-# We load that webpage in a browser and then type Control-U
-
-# You search the code in a page with Control-F in Firefox
-
-# Here is the name of the Abraham Lincoln Birthplace:
-xpathSApply(htmlParse(getURL("https://www.nps.gov/abli/index.htm")), "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue)
-
-# Here is the street address:
-xpathSApply(htmlParse(getURL("https://www.nps.gov/abli/index.htm")), "//*/span[@itemprop='streetAddress']", xmlValue)
-
-# Alternatively, we can also do this with the file itself,
-# instead of pulling the data from the web:
-
-xpathSApply(htmlParse("/depot/statclass/data/parks/abli.htm"), "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue)
-
-xpathSApply(htmlParse("/depot/statclass/data/parks/abli.htm"), "//*/span[@itemprop='streetAddress']", xmlValue)
-----
-
-== Project 20 Answers
-
-[source,R]
-----
-
-install.packages("XML")
-library(XML)
-
-# 1a. here are the song titles, 2 through 100, from (for instance) January 20, 1990:
-# but students should use their OWN BIRTHDAYS for this question.
-xpathSApply(htmlParse(getURL("https://www.billboard.com/charts/hot-100/2000-01-20")),
-"//*/div[@class='chart-list-item__title']", xmlValue)
-
-# 1b. here are the artists of the songs 2 through 100:
-xpathSApply(htmlParse(getURL("https://www.billboard.com/charts/hot-100/2000-01-20")),
- "//*/div[@class='chart-list-item__artist']", xmlValue)
-
-# 1c. here is the title of the number 1 song from that week:
-xpathSApply(htmlParse(getURL("https://www.billboard.com/charts/hot-100/2000-01-20")),
- "//*/div[@class='chart-number-one__title']", xmlValue)
-
-# 1d. here is the artist for the number 1 song from that week:
-xpathSApply(htmlParse(getURL("https://www.billboard.com/charts/hot-100/2000-01-20")),
- "//*/div[@class='chart-number-one__artist']", xmlValue)
-
-# 2a. Here is the city:
-xpathSApply(htmlParse(getURL("https://www.nps.gov/cato/index.htm")),
- "//*/span[@itemprop='addressLocality']", xmlValue)
-# alternatively:
-xpathSApply(htmlParse("/depot/statclass/data/parks/cato.htm"),
- "//*/span[@itemprop='addressLocality']", xmlValue)
-
-# 2b. Here is the state:
-xpathSApply(htmlParse(getURL("https://www.nps.gov/cato/index.htm")),
- "//*/span[@itemprop='addressRegion']", xmlValue)
-# alternatively:
-xpathSApply(htmlParse("/depot/statclass/data/parks/cato.htm"),
- "//*/span[@itemprop='addressRegion']", xmlValue)
-
-# 2c. Here is the zip:
-xpathSApply(htmlParse(getURL("https://www.nps.gov/cato/index.htm")),
- "//*/span[@itemprop='postalCode']", xmlValue)
-# alternatively:
-xpathSApply(htmlParse("/depot/statclass/data/parks/cato.htm"),
- "//*/span[@itemprop='postalCode']", xmlValue)
-
-# 3a, 3b answers will vary
-
-----
-
-
-== Project 21
-
-
-Please use this template to submit Project 21:
-
-https://raw.githubusercontent.com/TheDataMine/STAT-19000/master/Assignments/hw21.Rmd
-
-This project is supposed to be an easy modification of the project example,
-since it is almost time for Spring Break!
-
-1. Modify the NPS example to extract the city location for every National Park.
-
-2. Same question, for the state location for every National Park.
-
-3. Same question, for the zip code for every National Park.
-
-Note: Do not worry if some of the results have extra spaces. We can deal with that later!
-
-
-
-== Project 21 Example:
-
-[source,R]
-----
-
-library(RCurl)
-library(XML)
-
-# The webpage for the National Park Service includes
-# only a little information about every NPS property:
-# https://www.nps.gov/findapark/index.htm
-# Importantly, it has the 4-letter codes for each property.
-
-# If you type Control-U, then you can see the source for the page.
-# Scroll down, and you will see on
-# lines 210 through 753 these 4-letter codes
-# (It might not be exactly lines 210 through 753 because the NPS
-# modifies its webpages, just like any organization does!)
-
-# Each such NPS property has the 4-letter code as
-# an attribute to one of the XML tags. They are all found inside
-# of a "select" tag,
-# and then inside an "optgroup" tag,
-# and then inside an "option" tag.
-# You extract this XML value using the xmlGetAttr, like this:
-
-myparks <- xpathSApply(htmlParse(getURL("https://www.nps.gov/findapark/index.htm")), "//*/div/select/optgroup/option", xmlGetAttr, "value")
-
-# and then we see the full listing of all 497 of these 4-digit codes here:
-
-myparks
-
-# Last week, we already learned how to extract the street address of a park.
-# For instance, this is the name of the Abraham Lincoln Birthplace:
-
-xpathSApply(htmlParse(getURL("https://www.nps.gov/abli/index.htm")),
- "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue)
-
-# Similarly, this is the name of Catoctin Mountain:
-
-xpathSApply(htmlParse(getURL("https://www.nps.gov/cato/index.htm")),
- "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue)
-
-# Here's the name of the Great Smoky Mountains;
-# we just change "abli" or "cato" to "grsm" and we have it!
-
-xpathSApply(htmlParse(getURL("https://www.nps.gov/grsm/index.htm")),
- "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue)
-
-# In general, we could paste in the 4-digit letter of the park, like this:
-
-x <- "abli"
-xpathSApply(htmlParse(getURL( paste0("https://www.nps.gov/", x, "/index.htm"))),
- "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue)
-
-# where the value of "x" is the park's 4-digit abbreviation.
-# Let's try to get these two park names simultaneously now.
-
-# We build a function to do so:
-
-mynameextractor <- function(x) {xpathSApply(htmlParse(getURL( paste0("https://www.nps.gov/", x, "/index.htm"))),
- "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue)}
-
-# and then we apply it to each of these 4-letter codes:
-
-sapply( c("abli", "cato", "grsm"), mynameextractor )
-
-# One thing about scraping data from the web is that
-# there are always "hiccups" in the process,
-# i.e., there are always challenges.
-# For instance, we have codes for "cbpo" and "foca"
-# but those pages do not actually exist (yet).
-# So we need to remove them from our list of 4-letter codes:
-
-mygoodparks <- myparks[(myparks != "cbpo")&(myparks != "foca")]
-
-# Now we are ready to apply our function to
-# all the NPS properties. We do it first to the "head",
-# just to make sure things are working:
-
-myresults <- sapply( head(mygoodparks), mynameextractor )
-
-myresults
-
-# and if this worked, then we apply it to the full list of parks.
-# P.S. Depending on your web connection, and how many
-# students do this at one time, you might need to run
-# this a few times. It did not work quite right for me
-# on the first try, but that is the nature of websites,
-# i.e., sometimes there are failures and/or service interruptions,
-# but it should generally work in just a few minutes!
-
-myresults <- sapply( mygoodparks, mynameextractor )
-
-# Finally, here are the names of all the park properties:
-
-myresults
-
-
-----
-
-== Project 22
-
-Here is an *optional* Project 22. You don't need to do it, but if you choose to do it, we will count it as a replacement for your lowest previous project grade.
-
-In this folder on Scholar: `/depot/statclass/data/examples` there is a program called "challenge", so you can run it by typing in the terminal something like this: `/depot/statclass/data/examples/challenge 111`
-
-Here is the goal: You can try to make a program (in any language) that converts strings of digits to strings of letters, by substituting
-
-[source,bash]
-----
-1 -> a
-2 -> b
-3 -> c
-......
-26 -> z
-----
-
-Please notice that we do *not* say
-
-[source,bash]
-----
-01 -> a
-----
-
-but rather, we say
-
-[source,bash]
-----
-1 -> a
-----
-
-The program should print the number of ways to do this.
-
-So, for instance, if you type:
-
-`/depot/statclass/data/examples/challenge 111`
-
-It will return the number 3 because there are exactly 3 ways to decode the string 111, namely:
-
-[source,bash]
-----
-ak
-ka
-aaa
-----
-
-Makes sense? Here is another example:
-
-`/depot/statclass/data/examples/challenge 15114`
-
-will return the number 6 because there are exactly 6 ways to decode the string 15114, namely:
-
-[source,bash]
-----
-aeaad
-aean
-aekd
-oaad
-oan
-okd
-----
-
-The challenge (again, only for bonus credit) is to write a program that will produce the same results as the program that I gave you. You are welcome to use any programming language.
-
-== Project 23
-
-Here is the project. We will build on this project in the upcoming work that we will do in April.
-
-Recall that in Project 20, question 3ab, you identified some websites that you were interested to scrape. Pick only 1 of the websites that is of interest to you, and scrape at least 5 pieces of information from a few pages within that website. (I am being a little nebulous here, because I want you to have the freedom to explore!) For instance, you could pick IMDB as the website and scrape 3 pieces of information from 5 different movies. BUT you can pick any website. It does *NOT* need to be the IMDB website. You can do any website you like. That's the entire assignment for this week! If you are not able to do it for the sites that you mentioned in Project 20, then you can (instead) identify a different website to scrape.
-
-[source,R]
-----
-# We recall that we can scrape information (which is stored in XML format) from the internet, using XPath.
-# Remember that we load Scholar in the web interface and open a browser and use Control-U to see the XML code.
-# Inside R, we first load the XML library and the RCurl library:
-
-library(XML)
-library(RCurl)
-
-# Then we just download the webpage and we put the path to the desired web content into the notation of XPath.
-
-# We already gave some examples of how to scrape XML data from the web,
-# back in Project 20 and Project 21. Please feel welcome to read those again and remind yourself.
-
-# Here are a few more examples to inspire you, about how to scrape and parse some XML code from the internet.
-
-#####################################################
-# Example: IMDB (Internet Movie Database)
-# We can scrape information about movies. For instance, IMDB is a popular movie website.
-# The information about the movie Say Anything is given here: https://www.imdb.com/title/tt0098258/
-# This is Dr Ward's favorite movie, by the way!
-# Here is the title and year, which are stored together in the same place.
-xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")),
- "//*/div[@class='title_wrapper']/h1", xmlValue)
-
-
-# In this XML, if you only want the year 1989 in which the movie was made,
-# but do not care about the title, then just go deeper, by
-# also including the "span" and "a" tags too:
-
-xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")),
- "//*/div[@class='title_wrapper']/h1/span/a", xmlValue)
-
-# Here is a completely different place in the XML to find the title:
-
-xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")),
- "//*/div/div[@id='ratingWidget']/p/strong", xmlValue)
-
-
-
-# Here is the specific release date:
-
-xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")),
- "//*/a[@title='See more release dates']", xmlValue)
-
-
-
-# We can try to extract the Director and the Writer.
-# Cameron Crowe was both the Director and the Writer.
-# If we do the following search, we get 3 results:
-xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")),
- "//*/div[@class='credit_summary_item']", xmlValue)
-# So we could save this information in a vector, and just extract the first and second elements
-v <- xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")),
- "//*/div[@class='credit_summary_item']", xmlValue)
-# Now we have the DIrector:
-v[1]
-# and the Writer:
-v[2]
-# in separate elements.
-# There are other ways to do this, once we get more comfortable with XML
-# but this is a good start!
-
-
-
-# The title is stored lots and lots of places in the webpage.
-# It is also sometimes stored in an XML tag itself, rather than in the content of the page.
-# For instance, search for the phrase:
-# og:title
-# in the code in your browser to see this.
-
-xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")),
- "//*/meta[@property='og:title']", xmlGetAttr, "content")
-
-
-# Here's another way, which is just 2 lines later, in the source code for the page.
-# We just change "property" to "name"
-# and we change "og:title" to "title" and we get the title and year again:
-
-xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")),
- "//*/meta[@name='title']", xmlGetAttr, "content")
-
-# These are just meant to be illustrative examples to try to help! Have fun! Explore!
-
-
-----
-
-
-
-== Project 24
-
-We build on Project 23 as follows:
-
-[source,R]
-----
-# In Project 23, we scraped a few elements of data from a website.
-##################################################
-# Wrap your code from Project 23 into a function, and then
-# scrape at least 100,000 pieces of data from any website: your choice!
-##################################################
-# Here is an example of how to get started:
-# First we load the needed libraries:
-library(XML)
-library(RCurl)
-# Then we wrap our code into a function.
-mytitlefunc <- function(x) {
- xpathSApply(htmlParse(getURL(paste0("https://www.imdb.com/title/tt", x, "/"))),
- "//*/div[@class='title_wrapper']/h1", xmlValue)
-}
-
-# Notice that we replaced the website:
-# https://www.imdb.com/title/tt0098258/
-# with (instead) some code to build the website as we go:
-# paste0("https://www.imdb.com/title/tt", x, "/"
-# This uses the value x as the number of the movie.
-# Now we can run our function and extract the results for a movie:
-mytitlefunc("0098258")
-# Our function is vectorized, i.e., we can run it on a vector,
-# and it will return the results for each individual movie,
-# for instance:
-mytitlefunc(c("0110000", "0110001", "0110002", "0110003"))
-# We could try to run it on a sequence of numbers, but
-# this will not quite work at first.
-# For instance, if we try to run it on this sequence:
-110000:110003
-# We see that these numbers are only 6 digits, but the URL expects to
-# have a total of 7 digits to work.
-# So we can use the string print function,
-# which is also available in other languages too:
-sprintf("%07d", 110000:110003)
-# Here we have the "%" which means we are printing a variable,
-# and the "0" means we should pad things with leading zeroes if needed,
-# and the "7" means that we want 7 digits, and the "d" means digits.
-# Now it will work on this input
-mytitlefunc(sprintf("%07d", 110000:110003))
-# and we can even change this to (say) 100 pages at a time:
-mytitlefunc(sprintf("%07d", 110000:110100))
-----
-
-== Hint from Luke Francisco:
-
-Based on my experience with students during my office hours last week, I thought I would share some things to consider when you are looking for data to use on project 24 if you have not done so already. I should have posted this earlier, but it just now dawned on me that this would make a good Piazza post.
-
-When you are scraping large amounts of data from the web, you want to focus on replicability. If you look at Dr. Ward's previous two examples with the national parks data and the Billboard music data, you will see what I am talking about. There are several websites in each case (one webpage for each national park and one webpage for the Billboard top songs for each week). If you want to scrape all of this data you will need to give R all of the URL's in order to go find the data. What makes these two examples easy are that the webpages all have the exact same URL's except for one part. For example, the Billboard top songs all have the same URL except for a different date inserted. This allows you to make a vector of dates and insert each date into the URL. If this were not the case and the URL's were totally different for each week, you would have to find the URL for every week since 1980, which would be extremely laborious!!!!!
-
-You also want to make sure that each website is formatted similarly. Consider the addresses of the national parks - they were all found at the bottom of the page for the corresponding national park with the same HTML formatting. In the case of the Billboard data, the website for every week has the top songs entered in the exact same format - the only things that change are the song titles and artists, which are the data we are interested in. This means your code that pulls the songs on the Billboard charts from this week will also pull the songs on the Blackboard charts from 1980!!!!!!
-
-Consider this example I used in my office hours last week. Ken Pomeroy publishes statistics for all 353 men's division 1 college basketball teams. The data for the most recent season can be found at this link:
-
-https://kenpom.com/index.php?y=2019
-
-If you look at the HTML code of the website and CTRL+F search for Purdue, you will notice that the data for each team, which is the data in each row of the table on the website, is entered with the exact same format. Better yet, change the last four digits of the URL to 2018 and CTRL+F for Purdue again in the HTML code. The data for the 2018 season for every team is also entered in the exact same format. Even better is that you can change the year at the end of the URL to any year after 2002 and you will find a wealth of similarly formatted data. This meets the two criteria: 1.) URL's containing data have extremely similar formats and 2.) Each webpage has identical HTML formatting. This is the type of data you should look for when finding data for your project as it will make your life a whole lot easier!
-
-Also keep in mind that you want to get data that you can analyze in some way for project 25!
-
-Sorry for being long, but I hope this was helpful and inspiring. Remember to work smarter, not harder!
-
-== Project 25
-
-Question 1
-
-a. Store the 100,000 pieces of data that you scraped in Project 24 into a data frame.
-
-b. Save that data frame in an xlsx file, for instance, using the write.xlsx function from the library "xlsx".
-
-Question 2
-
-2a,2b,2c. Make 3 questions about the data that you assembled in Project 24.
-
-Question 3
-
-3a,3b,3c. Answer the 3 questions from 2a,2b,2c by making 3 visualizations from the data that you assembled. Be sure to use best practices for data visualization.
-
-Refer to the selections from the texts:
-
-The Elements of Graphing Data by William S. Cleveland
-
-and Creating More Effective Graphs by Naomi B. Robbins
-
-These selections are archived online here:
-
-http://llc.stat.purdue.edu/ElementsOfGraphingData.pdf
-
-http://llc.stat.purdue.edu/CreatingMoreEffectiveGraphs.pdf
-
-Submit your project in RMarkdown. Please be sure to submit the .Rmd file and also the .xlsx file created in 1b. Of course the graders will be unable to run your code for 1a, because they do not want to scrape all of the data that you scraped. Instead, the graders want to use the data from question 1b, so be sure to submit the .Rmd file and the .xlsx file too.
-
-== Optional Project 1
-
-Remind yourself how to run SQL queries in R, for instance, using the examples from Project 8.
-
-Question 1
-
-Find the largest number of home runs (by an individual batter) each year.
-
-For instance:
-
-in 2014 a player hit 40 HR's,
-
-in 2015 a player hit 47 HR's,
-
-in 2016 a player hit 47 HR's,
-
-in 2017 a player hit 59 HR's, and
-
-in 2018 a player hit 48 HR's.
-
-(Yes, I have updated the data to include 2018!!)
-
-Question 2
-
-Make a plot that shows this largest number of home runs per year (not just these 5 years, but the annual records back to 1871).
-
-Question 3
-
-Create a question about baseball that you are interested in, and use a SQL query in R to answer the question. Put all of your R code into an RMarkdown file, and give some comments about your code, to explain your method of solution. Submit the RMarkdown (.Rmd) file, and also a pdf file the shows the output (including the code, your explanation, the picture from question 2 that displays the plot, etc.).
-
-== Optional Project 2
-
-Recall how we can work with very large data sets (which are too large to import into R), by using UNIX. We did this in some of the earliest problem sets in STAT 19000, during the fall semester.
-
-Question 1
-
-a. How many taxi cab rides occurred (altogether) during 2015? Do not give a breakdown by month. Give the total number of taxi cab rides for the full year 2015. (Hint: Remember to be careful about the headers at the top of each file.)
-
-b. Give the distribution of the number of passengers in the taxi cab rides throughout (all months of) the year 2015. Do not give a breakdown by month. Give the distribution across the full year 2015.
-
-Question 2
-
-a. Across all years of the airline data, how many flights occurred on each airline? Which airline is the most popular overall, in terms of the number of flights?
-
-b. Across all years of the airline data, which flight path is the most popular? How many airplane trips occurred on that flight path?
-
-Question 3
-
-Create a question about taxi cab rides or airline flights that you are interested in, and use UNIX to answer the question.
-
-Put all of your UNIX code into plain text file, and give some comments about your code, to explain your method of solution. Submit the plain text (.txt) file with your code (including your explanations).
-
-== Optional Project 3
-
-Use R to analyze the election data from the 2018 election. Remember to use read.csv to read in the data, and use header=F (since there is no header) and use sep="|" since this symbol separates the data.
-
-Question 1
-
-a. Identify the top 20 employers that donated the most amount of money (altogether). Some of these entries will be strange, e.g., blank entries, NA, self employed, etc. That is OK!
-
-b. Plot the largest 20 total amounts (from the 20 employers) on a dotchart, in order from largest (at the top) to smallest (at the bottom).
-
-Question 2
-
-a. In which city/state is the average donation amount the largest? (Treat the city and state data together as a pair.)
-
-b. How many donations were given from this city/state pair? How large were the total amount of donations from this city/state pair?
-
-Question 3
-
-Create a question about the 2018 election data that you are interested in, and use R to answer the question.
-
-Put all of your R code into an RMarkdown file, and give some comments about your code, to explain your method of solution. Submit the RMarkdown (.Rmd) file, and also a pdf file the shows the output (including the code, your explanation, the picture from question 1b that displays the plot, etc.).
-
-== Optional Project 4
-
-Please submit your project in RMarkdown.
-
-Read the selection of The Elements of Graphing Data by William Cleveland, and the selection of Creating More Effective Graphs by Naomi Robbins.
-
-Also read the classic article "How to Display Data Badly" by Howard Wainer:
-
-http://www.jstor.org.ezproxy.lib.purdue.edu/stable/2683253
-
-We referred to both of these in Project 25.
-
-Question 1
-
-a. Find 3 visualizations from the Information Is Beautiful website (http://www.informationisbeautiful.net/) that do a BAD job of portraying data, according to the best practices in the selections mentioned above. Write 1/3 of a page (for each such visualization) about what is done poorly, i.e., write 1 single-spaced page total.
-
-b. Identify 3 excellent visualizations of data from the Information Is Beautiful website. Write 1/3 of a page (for each such visualization) about what is done well, i.e., write 1 single-spaced page total.
-
-Question 2
-
-Consider the poster winner "Congestion in the Sky", from the 2009 Data Expo: http://stat-computing.org/dataexpo/2009/posters/
-
-a. Describe at least 3 significant ways that this poster could be improved. For each of these 3 ways, write a 1/3 of a page constructive criticism, specifying what could be improved and how that aspect of the visualization could be done better, i.e., write 1 single-spaced page total.
-
-b. Which of the posters in the Data Expo 2009 do you think should be the winner? Why? (It is OK if you choose the poster that actually won, or any of the other posters.) Thoroughly justify your answer, using the techniques of effective data visualization, to justify your answer (write 1 single-spaced page total).
-
-(This entire assignment is 4 single-spaced pages.)
-
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project01.adoc
deleted file mode 100644
index 1d69e1b48..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project01.adoc
+++ /dev/null
@@ -1,169 +0,0 @@
-= STAT 19000: Project 1 -- Fall 2020
-
-**Motivation:** In this project we are going to jump head first into The Data Mine. We will load datasets into the R environment, and introduce some core programming concepts like variables, vectors, types, etc. As we will be "living" primarily in an IDE called RStudio, we will take some time to learn how to connect to it, configure it, and run code.
-
-**Context:** This is our first project as a part of The Data Mine. We will get situated, configure the environment we will be using throughout our time with The Data Mine, and jump straight into working with data!
-
-**Scope:** r, rstudio, Scholar
-
-.Learning Objectives
-****
-- Use Jupyter Notebook to run Python code and create Markdown text.
-- Use RStudio to run Python code and compile your final PDF.
-- Gain exposure to Python control flow and reading external data.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/open_food_facts/openfoodfacts.tsv`
-
-== Questions
-
-=== Question 1
-
-Navigate to https://notebook.scholar.rcac.purdue.edu/ and sign in with your Purdue credentials (_without_ BoilerKey). This is an instance of Jupyter Notebook. The main screen will show a series of files and folders that are in your `$HOME` directory. Create a new notebook by clicking on menu:New[f2020-s2021].
-
-Change the name of your notebook to "LASTNAME_FIRSTNAME_project01" where "LASTNAME" is your family name, and "FIRSTNAME" is your given name. Try to export your notebook (using the menu:[File] dropdown menu, choosing the option menu:[Download as]), what format options (for example, `.pdf`) are available to you?
-
-[NOTE]
-`f2020-s2021` is the name of our course notebook kernel. A notebook kernel is an engine that runs code in a notebook. ipython kernels run Python code. `f2020-s2021` is an ipython kernel that we've created for our course Python environment, which contains a variety of compatible, pre-installed packages for you to use. When you select `f2020-s2021` as your kernel, all of the packages in our course environment are automatically made available to you.
-
-https://mediaspace.itap.purdue.edu/id/1_4g2lwx5g[Click here for video]
-
-.Items to submit
-====
-- A list of export format options.
-====
-
-=== Question 2
-
-Each "box" in a Jupyter Notebook is called a _cell_. There are two primary types of cells: code, and markdown. By default, a cell will be a code cell. Place the following Python code inside the first cell, and run the cell. What is the output?
-
-[source,python]
-----
-from thedatamine import hello_datamine
-hello_datamine()
-----
-
-[TIP]
-You can run the code in the currently selected cell by using the GUI (the buttons), as well as by pressing kbd:[Ctrl+Enter] or kbd:[Ctrl+Return].
-
-.Items to submit
-====
-- Output from running the provided code.
-====
-
-=== Question 3
-
-Jupyter Notebooks allow you to easily pull up documentation, similar to `?function` in R. To do so, use the `help` function, like this: `help(my_function)`. What is the output from running the help function on `hello_datamine`? Can you modify the code from question (2) to print a customized message? Create a new _markdown_ cell and explain what you did to the code from question (2) to make the message customized.
-
-[IMPORTANT]
-====
-Some Jupyter-only methods to do this are:
-
-- Click on the function of interest and type `Shift+Tab` or `Shift+Tab+Tab`.
-- Run `function?`, for example, `print?`.
-====
-
-[IMPORTANT]
-You can also see the source code of a function in a Jupyter Notebook by typing `function??`, for example, `print??`.
-
-.Items to submit
-====
-- Output from running the `help` function on `hello_datamine`.
-- Modified code from question (2) that prints a customized message.
-====
-
-=== Question 4
-
-At this point in time, you've now got the basics of running Python code in Jupyter Notebooks. There is really not a whole lot more to it. For this class, however, we will continue to create RMarkdown documents in addition to the compiled PDFs. You are welcome to use Jupyter Notebooks for personal projects or for testing things out, however, we will still require an RMarkdown file (.Rmd), PDF (generated from the RMarkdown file), and .py file (containing your python code). For example, please move your solutions from Questions 1, 2, 3 from Jupyter Notebooks over to RMarkdown (we discuss RMarkdown below). Let's learn how to run Python code chunks in RMarkdown.
-
-Sign in to https://rstudio.scholar.rcac.purdue.edu (_with_ BoilerKey). Projects in The Data Mine should all be submitted using our template found https://raw.githubusercontent.com/TheDataMine/the-examples-book/master/files/project_template.Rmd[here] or on Scholar (`/class/datamine/apps/templates/project_template.Rmd`).
-
-Open the project template and save it into your home directory, in a new RMarkdown file named `project01.Rmd`. Prior to running any Python code, run `datamine_py()` in the R console, just like you did at the beginning of every project from the first semester.
-
-Code chunks are parts of the RMarkdown file that contains code. You can identify what type of code a code chunk contains by looking at the _engine_ in the curly braces "{" and "}". As you can see, it is possible to mix and match different languages just by changing the engine. Move the solutions for questions 1-3 to your `project01.Rmd`. Make sure to place all Python code in `python` code chunks. Run the `python` code chunks to ensure you get the same results as you got when running the Python code in a Jupyter Notebook.
-
-[NOTE]
-Make sure to run `datamine_py()` in the R console prior to attempting to run any Python code.
-
-[TIP]
-The end result of the `project01.Rmd` should look _similar_ to https://raw.githubusercontent.com/TheDataMine/the-examples-book/master/files/example02.Rmd[this].
-
-https://mediaspace.itap.purdue.edu/id/1_nhkygxg9[Click here for video]
-
-https://mediaspace.itap.purdue.edu/id/1_tdz3wmim[Click here for video]
-
-.Items to submit
-====
-- `project01.Rmd` with the solutions from questions 1-3 (including any Python code in `python` code chunks).
-====
-
-=== Question 5
-
-It is not a Data Mine project without data! [Here] are some examples of reading in data line by line using the `csv` package. How many columns are in the following dataset: `/class/datamine/data/open_food_facts/openfoodfacts.tsv`? Print the first row, the number of columns, and then exit the loop after the first iteration using the `break` keyword.
-
-[TIP]
-You can get the number of elements in a list by using the `len` method. For example: `len(my_list)`.
-
-[TIP]
-You can use the `break` keyword to exit a loop. As soon as `break` is executed, the loop is exited and the code immediately following the loop is run.
-
-[source,python]
-----
-for my_row in my_csv_reader:
- print(my_row)
- break
-print("Exited loop as soon as 'break' was run.")
-----
-
-[TIP]
-`'\t'` represents a tab in Python.
-
-https://mediaspace.itap.purdue.edu/id/1_ck74xlzq[Click here for video]
-
-[IMPORTANT]
-If you get a Dtype warning, feel free to just ignore it.
-
-Relevant topics:* [for loops], [break], [print]
-
-.Items to submit
-====
-- Python code used to solve this problem.
-- The first row printed, and the number of columns printed.
-====
-
-=== Question 6 (optional)
-
-Unlike in R, where many of the tools you need are built-in (`read.csv`, data.frames, etc.), in Python, you will need to rely on packages like `numpy` and `pandas` to do the bulk of your data science work.
-
-In R it would be really easy to find the mean of the 151st column, `caffeine_100g`:
-
-[source,r]
-----
-myDF <- read.csv("/class/datamine/data/open_food_facts/openfoodfacts.tsv", sep="\t", quote="")
-mean(myDF$caffeine_100g, na.rm=T) # 2.075503
-----
-
-If you were to try to modify our loop from question (5) to do the same thing, you will run into a myriad of issues, just to try and get the mean of a column. Luckily, it is easy to do using `pandas`:
-
-[source,python]
-----
-import pandas as pd
-myDF = pd.read_csv("/class/datamine/data/open_food_facts/openfoodfacts.tsv", sep="\t")
-myDF["caffeine_100g"].mean() # 2.0755028571428573
-----
-
-Take a look at some of the methods you can perform using pandas https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats[here]. Perform an interesting calculation in R, and replicate your work using `pandas`. Which did you prefer, Python or R?
-
-https://mediaspace.itap.purdue.edu/id/1_ybx1iukd[Click here for video]
-
-.Items to submit
-====
-- R code used to solve the problem.
-- Python code used to solve the problem.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project02.adoc
deleted file mode 100644
index f37e3bca0..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project02.adoc
+++ /dev/null
@@ -1,142 +0,0 @@
-= STAT 19000: Project 2 -- Fall 2020
-
-*Introduction to R using 84.51 examples*
-
-++++
-
-++++
-
-*Introduction to R using NYC Yellow Taxi Cab examples*
-
-++++
-
-++++
-
-**Motivation:** The R environment is a powerful tool to perform data analysis. R is a tool that is often compared to Python. Both have their advantages and disadvantages, and both are worth learning. In this project we will dive in head first and learn the basics while solving data-driven problems.
-
-**Context:** Last project we set the stage for the rest of the semester. We got some familiarity with our project templates, and modified and ran some R code. In this project, we will continue to use R within RStudio to solve problems. Soon you will see how powerful R is and why it is often a more effective tool to use than spreadsheets.
-
-**Scope:** r, vectors, indexing, recycling
-
-.Learning Objectives
-****
-- List the differences between lists, vectors, factors, and data.frames, and when to use each.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Read and write basic (csv) data.
-- Explain what "recycling" is in R and predict behavior of provided statements.
-- Identify good and bad aspects of simple plots.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/disney/metadata.csv`
-
-== Questions
-
-=== Question 1
-
-Use the `read.csv` function to load `/class/datamine/data/disney/metadata`.csvinto a `data.frame` called `myDF`. Note that `read.csv` _by default_ loads data into a `data.frame`. (We will learn more about the idea of a `data.frame`, but for now, just think of it like a spreadsheet, in which each column has the same type of data.) Print the first few rows of `myDF` using the `head` function (as in Project 1, Question 7).
-
-.Items to submit
-====
-- R code used to solve the problem in an R code chunk.
-====
-
-=== Question 2
-
-We've provided you with R code below that will extract the column `WDWMAXTEMP` of `myDF` into a vector. What is the 1st value in the vector? What is the 50th value in the vector? What type of data is in the vector? (For this last question, use the `typeof` function to find the type of data.)
-
-[source,r]
-----
-our_vec <- myDF$WDWMAXTEMP
-----
-
-.Items to submit
-====
-- R code used to solve the problem in an R code chunk.
-- The values of the first, and 50th element in the vector.
-- The type of data in the vector (using the `typeof` function).
-====
-
-=== Question 3
-
-Use the head function to create a vector called `first50` that contains the first 50 values of the vector `our_vec`. Use the tail function to create a vector called `last50` that contains the last 50 values of the vector `our_vec`.
-
-You can access many elements in a vector at the same time. To demonstrate this, create a vector called `mymix` that contain the sum of each element of `first50` being added to the analogous element of `last50`.
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The contents of each of the three vectors.
-====
-
-=== Question 4
-
-In (3), we were able to rapidly add values together from two different vectors. Both vectors were the same size, hence, it was obvious which elements in each vector were added together.
-
-Create a new vector called `hot` which contains only the values of `myDF$WDWMAXTEMP` which are greater than or equal to 80 (our vector contains max temperatures for days at Disney World). How many elements are in `hot`?
-
-Calculate the sum of `hot` and `first50`. Do we get a warning? Read https://excelkingdom.blogspot.com/2018/01/what-recycling-of-vector-elements-in-r.html[this] and then explain what is going on.
-
-.Items to submit
-====
-- R code used to solve this problem.
-- 1-2 sentences explaining what is happening when we are adding two vectors of different lengths.
-====
-
-=== Question 5
-
-Plot the `WDWMAXTEMP` vector from `myDF`.
-
-.Items to submit
-====
-- R code used to solve this problem.
-- Plot of the `WDWMAXTEMP` vector from `myDF`.
-====
-
-=== Question 6
-
-The following three pieces of code each create a graphic. The first two graphics are created using only core R functions. The third graphic is created using a package called `ggplot`. We will learn more about all of these things later on. For now, pick your favorite graphic, and write 1-2 sentences explaining why it is your favorite, what could be improved, and include any interesting observations (if any).
-
-[source,r]
-----
-dat <- table(myDF$SEASON)
-dotchart(dat, main="Seasons", xlab="Number of Days in Each Season")
-----
-
-image:stat19000project2figure1.png["A plot resembling an abacus, where holiday is listed in a vertical list and the corresponding number of days are on that horizontal line, the further right indicating more days for that holiday classification. The clear winner is Spring.", loading=lazy]
-
-[source,r]
-----
-dat <- tapply(myDF$WDWMEANTEMP, myDF$DAYOFYEAR, mean, na.rm=T)
-seasons <- tapply(myDF$SEASON, myDF$DAYOFYEAR, function(x) unique(x)[1])
-pal <- c("#4E79A7", "#F28E2B", "#A0CBE8", "#FFBE7D", "#59A14F", "#8CD17D", "#B6992D", "#F1CE63", "#499894", "#86BCB6", "#E15759", "#FF9D9A", "#79706E", "#BAB0AC", "#1170aa", "#B07AA1")
-colors <- factor(seasons)
-levels(colors) <- pal
-par(oma=c(7,0,0,0), xpd=NA)
-barplot(dat, main="Average Temperature", xlab="Jan 1 (Day 0) - Dec 31 (Day 365)", ylab="Degrees in Fahrenheit", col=as.factor(colors), border = NA, space=0)
-legend(0, -30, legend=levels(factor(seasons)), lwd=5, col=pal, ncol=3, cex=0.8, box.col=NA)
-----
-
-image:stat19000project2figure2.png["A filled line plot with colors corresponding to the predominant holiday at the time.", loading=lazy]
-
-[source,r]
-----
-library(ggplot2)
-library(tidyverse)
-summary_temperatures <- myDF %>%
- select(MONTHOFYEAR,WDWMAXTEMP:WDWMEANTEMP) %>%
- group_by(MONTHOFYEAR) %>%
- summarise_all(mean, na.rm=T)
-ggplot(summary_temperatures, aes(x=MONTHOFYEAR)) +
- geom_ribbon(aes(ymin = WDWMINTEMP, ymax = WDWMAXTEMP), fill = "#ceb888", alpha=.5) +
- geom_line(aes(y = WDWMEANTEMP), col="#5D8AA8") +
- geom_point(aes(y = WDWMEANTEMP), pch=21,fill = "#5D8AA8", size=2) +
- theme_classic() +
- labs(x = 'Month', y = 'Temperature', title = 'Average temperature range' ) +
- scale_x_continuous(breaks=1:12, labels=month.abb)
-----
-
-image:stat19000project2figure3.png["Line plot of temperatures over months including the range. Displays a very clear arch, highs in July-August at an average of 82 degrees Fahrenheit and lows at an average of 52 degrees Fahrenheit in January.", loading=lazy]
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project03.adoc
deleted file mode 100644
index 329ed3e4b..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project03.adoc
+++ /dev/null
@@ -1,125 +0,0 @@
-= STAT 19000: Project 3 -- Fall 2020
-
-**Motivation:** `data.frame`s are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a `data.frame`.
-
-**Context:** In the previous project we got our feet wet, and ran our first R code, and learned about accessing data inside vectors. In this project we will continue to reinforce what we've already learned and introduce a new, flexible data structure called `data.frame`s.
-
-**Scope:** r, data.frames, recycling, factors
-
-.Learning Objectives
-****
-- Explain what "recycling" is in R and predict behavior of provided statements.
-- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- List the differences between lists, vectors, factors, and data.frames, and when to use each.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/disney`
-
-== Questions
-
-=== Question 1
-
-Read the dataset `/class/datamine/data/disney/splash_mountain.csv` into a data.frame called `splash_mountain`. How many columns, or features are in each dataset? How many rows or observations?
-
-.Items to submit
-====
-- R code used to solve the problem.
-- How many columns or features in each dataset?
-====
-
-=== Question 2
-
-Splash Mountain is a fan favorite ride at Disney World's Magic Kingdom theme park. `splash_mountain` contains a series of dates and datetimes. For each datetime, `splash_mountain` contains a posted minimum wait time, `SPOSTMIN`, and an actual minimum wait time, `SACTMIN`. What is the average posted minimum wait time for Splash Mountain? What is the standard deviation? Based on the fact that `SPOSTMIN` represents the posted minimum wait time for our ride, does our mean and standard deviation make sense? Explain. (You might look ahead to Question 3 before writing the answer to Question 2.)
-
-[TIP]
-====
-If you got `NA` or `NaN` as a result, see xref:programming-languages:R:mean.adoc[here].
-====
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- 1-2 sentences explaining why or why not the results make sense.
-====
-
-=== Question 3
-
-In (2), we got some peculiar values for the mean and standard deviation. If you read the "attractions" tab in the file `/class/datamine/data/disney/touringplans_data_dictionary.xlsx`, you will find that -999 is used as a value in `SPOSTMIN` and `SACTMIN` to indicate the ride as being closed. Recalculate the mean and standard deviation of `SPOSTMIN`, excluding values that are -999. Does this seem to have fixed our problem?
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The result of running the R code.
-- A statement indicating whether or not the value look reasonable now.
-====
-
-=== Question 4
-
-`SPOSTMIN` and `SACTMIN` aren't the greatest feature/column names. An outsider looking at the data.frame wouldn't be able to immediately get the gist of what they represent. Change `SPOSTMIN` to `posted_min_wait_time` and `SACTMIN` to `actual_wait_time`.
-
-**Hint:** You can always use hard-coded integers to change names manually, however, if you use `which`, you can get the index of the column name that you would like to change. For data.frames like `splash_mountain`, this is a lot more efficient than manually counting which column is the one with a certain name.
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The output from executing `names(splash_mountain)` or `colnames(splash_mountain)`.
-====
-
-=== Question 5
-
-Use the `cut` function to create a new vector called `quarter` that breaks the `date` column up by quarter. Use the `labels` argument in the `factor` function to label the quarters "q1", "q2", ..., "qX" where `X` is the last quarter. Add `quarter` as a column named `quarter` in `splash_mountain`. How many quarters are there?
-
-[TIP]
-====
-If you have 2 years of data, this will result in 8 quarters: "q1", ..., "q8".
-====
-
-[TIP]
-====
-We can generate sequential data using `seq` and `paste0`:
-
-[source,r]
-----
-paste0("item", seq(1, 5))
-----
-
-or
-
-[source,r]
-----
-paste0("item", 1:5)
-----
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The `head` and `tail` of `splash_mountain`.
-- The number of quarters in the new `quarter` column.
-====
-
-Question 5 is intended to be a little more challenging, so we worked through the _exact_ same steps, with two other data sets. That way, if you work through these, all you will need to do, to solve Question 5, is to follow the example, and change two things, namely, the data set itself (in the `read.csv` file) and also the format of the date.
-
-This basically steps you through _everything_ in Question 5.
-
-We hope that these are helpful resources for you! We appreciate you very much and we are here to support you! You would not know how to solve this question on your own--because we are just getting started--but we like to sometimes put in a question like this, in which you get introduced to several new things, and we will dive deeper into these ideas as we push ahead.
-
-++++
-
-++++
-
-++++
-
-++++
-
-=== Question 6
-
-Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at datamine@purdue.edu for an alternative plan.
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project04.adoc
deleted file mode 100644
index 4ba1177d9..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project04.adoc
+++ /dev/null
@@ -1,168 +0,0 @@
-= STAT 19000: Project 4 -- Fall 2020
-
-**Motivation:** Control flow is (roughly) the order in which instructions are executed. We can execute certain tasks or code _if_ certain requirements are met using if/else statements. In addition, we can perform operations many times in a loop using for loops. While these are important concepts to grasp, R differs from other programming languages in that operations are usually vectorized and there is little to no need to write loops.
-
-**Context:** We are gaining familiarity working in RStudio and writing R code. In this project we introduce and practice using control flow in R.
-
-**Scope:** r, data.frames, recycling, factors, if/else, for
-
-.Learning objectives
-****
-- Explain what "recycling" is in R and predict behavior of provided statements.
-- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- List the differences between lists, vectors, factors, and data.frames, and when to use each.
-- Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/disney`
-
-== Questions
-
-=== Question 1
-
-Use `read.csv` to read in the `/class/datamine/data/disney/splash_mountain.csv` data into a `data.frame` called `splash_mountain`. In the previous project we calculated the mean and standard deviation of the `SPOSTMIN` (posted minimum wait time). These are vectorized operations (we will learn more about this next project). Instead of using the `mean` function, use a loop to calculate the mean (average), just like the previous project. Do not use `sum` either.
-
-[TIP]
-====
-Remember, if a value is NA, we don't want to include it.
-====
-
-[TIP]
-====
-Remember, if a value is -999, it means the ride is closed, we don't want to include it.
-====
-
-[NOTE]
-====
-This exercise should make you appreciate the variety of useful functions R has to offer!
-====
-
-.Items to submit
-====
-- R code used to solve the problem w/comments explaining what the code does.
-- The mean posted wait time.
-====
-
-=== Question 2
-
-Choose one of the `.csv` files containing data for a ride. Use `read.csv` to load the file into a data.frame named `ride_name` where "ride_name" is the name of the ride you chose. Use a for loop to loop through the ride file and add a new column called `status`. `status` should contain a string whose value is either "open", or "closed". If `SPOSTMIN` or `SACTMIN` is -999, classify the row as "closed". Otherwise, classify the row as "open". After `status` is added to your data.frame, convert the column to a `factor`.
-
-[TIP]
-====
-If you want to access two columns at once from a data.frame, you can do: `splash_mountain[i, c("SPOSTMIN", "SACTMIN")]`.
-====
-
-[NOTE]
-====
-For loops are often [much slower (here is a video to demonstrate)](#r-for-loops-versus-vectorized-functions) than vectorized functions, as we will see in (3) below.
-====
-
-.Items to submit
-====
-- R code used to solve the problem w/comments explaining what the code does.
-- The output from running `str` on `ride_name`.
-====
-
-In this video, we basically go all the way through Question 2 using a video:
-
-++++
-
-++++
-
-=== Question 3
-
-Typically you want to avoid using for loops (or even apply functions (we will learn more about these later on, don't worry)) when they aren't needed. Instead you can use vectorized operations and indexing. Repeat (2) without using any for loops or apply functions (instead use indexing and the `which` function). Which method was faster?
-
-[TIP]
-====
-To have multiple conditions within the `which` statement, use `|` for logical OR and `&` for logical AND.
-====
-
-[TIP]
-====
-You can start by assigning every value in `status` as "open", and then change the correct values to "closed".
-====
-
-[NOTE]
-====
-Here is a [complete example (very much like question 3) with another video](#r-example-safe-versus-contaminated) that shows how we can classify objects.
-====
-
-[NOTE]
-====
-Here is a [complete example with a video](#r-example-for-loops-compared-to-vectorized-functions) that makes a comparison between the concept of a for loop versus the concept for a vectorized function.
-====
-
-.Items to submit
-====
-- R code used to solve the problem w/comments explaining what the code does.
-- The output from running `str` on `ride_name`.
-====
-
-=== Question 4
-
-Create a pie chart for open vs. closed for `splash_mountain.csv`. First, use the `table` command to get a count of each `status`. Use the resulting table as input to the `pie` function. Make sure to give your pie chart a title that somehow indicates the ride to the audience.
-
-.Items to submit
-====
-- R code used to solve the problem w/comments explaining what the code does.
-- The resulting plot displayed as output in the RMarkdown.
-====
-
-=== Question 5
-
-Loop through the vector of files we've provided below, and create a pie chart of open vs closed for each ride. Place all 6 resulting pie charts on the same image. Make sure to give each pie chart a title that somehow indicates the ride.
-
-[source,r]
-----
-ride_names <- c("splash_mountain", "soarin", "pirates_of_caribbean", "expedition_everest", "flight_of_passage", "rock_n_rollercoaster")
-ride_files <- paste0("/class/datamine/data/disney/", ride_names, ".csv")
-----
-
-[TIP]
-====
-To place all of the resulting pie charts in the same image, prior to running the for loop, run `par(mfrow=c(2,3))`.
-====
-
-This is not exactly the same, but it is a similar example, using the campaign election data:
-
-[source,r]
-----
-mypiechart <- function(x) {
- myDF <- read.csv( paste0("/class/datamine/data/election/itcont", x, ".txt"), sep="|")
- mystate <- rep("other", times=nrow(myDF))
- mystate[myDF$STATE == "CA"] <- "California"
- mystate[myDF$STATE == "TX"] <- "Texas"
- mystate[myDF$STATE == "NY"] <- "New York"
- myDF$stateclassification <- factor(mystate)
- pie(table(myDF$stateclassification))
-}
-myyears <- c("1980","1984","1988","1992","1996","2000")
-par(mfrow=c(2,3))
-for (i in myyears) {
- mypiechart(i)
-}
-----
-
-++++
-
-++++
-
-Here is another video, which guides students even more closely through Question 5.
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the problem w/comments explaining what the code does.
-- The resulting plot displayed as output in the RMarkdown.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project05.adoc
deleted file mode 100644
index 24a04054f..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project05.adoc
+++ /dev/null
@@ -1,152 +0,0 @@
-= STAT 19000: Project 5 -- Fall 2020
-
-**Motivation:** As briefly mentioned in project 4, R differs from other programming languages in that _typically_ you will want to avoid using for loops, and instead use vectorized functions and the apply suite. In this project we will demonstrate some basic vectorized operations, and how they are better to use than loops.
-
-**Context:** While it was important to stop and learn about looping and if/else statements, in this project, we will explore the R way of doing things.
-
-**Scope:** r, data.frames, recycling, factors, if/else, for
-
-.Learning objectives
-****
-- Explain what "recycling" is in R and predict behavior of provided statements.
-- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- List the differences between lists, vectors, factors, and data.frames, and when to use each.
-- Demonstrate a working knowledge of control flow in r: for loops .
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/fars`
-
-To get more information on the dataset, see https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/812602[here].
-
-== Questions
-
-=== Question 1
-
-The `fars` dataset contains a series of folders labeled by year. In each year folder there is (at least) the files `ACCIDENT.CSV`, `PERSON.CSV`, and `VEHICLE.CSV`. If you take a peek at any `ACCIDENT.CSV` file in any year, you'll notice that the column `YEAR` only contains the last two digits of the year. Add a new `YEAR` column that contains the _full_ year. Use the `rbind` function to create a data.frame called `accidents` that combines the `ACCIDENT.CSV` files from the years 1975 through 1981 (inclusive) into one big dataset. After creating that `accidents` data frame, change the values in the `YEAR` column from two digits to four digits (i.e., paste a 19 onto each year value).
-
-Here is a video to walk you through the method of solving Question 1.
-
-++++
-
-++++
-
-Here is another video, using two functions you have not (yet) learned, namely, `lapply` and `do.call`. You do **not** need to understand these yet. _It is just a glimpse of some powerful functions to come later in the course!_
-
-++++
-
-++++
-
-
-.Items to submit
-====
-- R code used to solve the problem/comments explaining what the code does.
-- The result of `unique(accidents$YEAR)`.
-====
-
-== Question 2
-
-Using the new `accidents` data frame that you created in (1), how many accidents are there in which 1 or more drunk drivers were involved in an accident with a school bus?
-
-[TIP]
-====
-Look at the variables `DRUNK_DR` and `SCH_BUS`.
-====
-
-Here is a video about a related problem with 3 fatalities (instead of considering drunk drivers).
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the problem/comments explaining what the code does.
-- The result/answer itself.
-====
-
-=== Question 3
-
-Again using the `accidents` data frame: For accidents involving 1 or more drunk drivers and a school bus, how many happened in each of the 7 years? Which year had the largest number of these types of accidents?
-
-Here is a video about the related problem with 3 fatalities (instead of considering drunk drivers), tabulated according to year.
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the problem/comments explaining what the code does.
-- The results.
-- Which year had the most qualifying accidents.
-====
-
-=== Question 4
-
-Again using the `accidents` data frame: Calculate the mean number of motorists involved in an accident (variable `PERSON`) with i drunk drivers, where i takes the values from 0 through 6.
-
-[TIP]
-====
-It is OK that there are no accidents involving just 5 drunk drivers.
-====
-
-[TIP]
-====
-You can use either a `for` loop or a `tapply` function to accomplish this question.
-====
-
-Here is a video about the related problem with 3 fatalities (instead of considering drunk drivers). We calculate the mean number of fatalities for accidents with `i` drunk drivers, where `i` takes the values from 0 through 6.
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the problem/comments explaining what the code does.
-- The output from running your code.
-====
-
-=== Question 5
-
-Again using the `accidents` data frame: We have a theory that there are more accidents in cold weather months for Indiana and states around Indiana. For this question, only consider the data for which `STATE` is one of these: Indiana (18), Illinois (17), Ohio (39), or Michigan (26). Create a barplot that shows the number of accidents by `STATE` and by month (`MONTH`) simultanously. What months have the most accidents? Are you surprised by these results? Explain why or why not?
-
-We guide students through the methodology for Question 5 in this video. We also add a legend, in case students want to distinguish which stacked barplot goes with each of the four States.
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the problem/comments explaining what the code does.
-- The output (plot) from running your code.
-- 1-2 sentences explaining which month(s) have the most accidents and whether or not this surprises you.
-====
-
-=== OPTIONAL QUESTION
-
-Spruce up your plot from (5). Do any of the following:
-
-- Add vibrant (and preferably colorblind friendly) colors to your plot
-- Add a title
-- Add a legend
-- Add month names or abbreviations instead of numbers
-
-[TIP]
-====
-https://www.r-graph-gallery.com/209-the-options-of-barplot.html[Here] is a resource to get you started.
-====
-
-.Items to submit
-====
-- R code used to solve the problem/comments explaining what the code does.
-- The output (plot) from running your code.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project06.adoc
deleted file mode 100644
index 149f198d1..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project06.adoc
+++ /dev/null
@@ -1,271 +0,0 @@
-= STAT 19000: Project 6 -- Fall 2020
-
-The `tapply` function works like this:
-
-`tapply( somedata, thewaythedataisgrouped, myfunction)`
-
-[source,r]
-----
-myDF <- read.csv("/class/datamine/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv")
-head(myDF)
-----
-
-We could do four computations to compute the `mean` `SPEND` amount in each `STORE_R`...
-
-[source,r]
-----
-mean(myDF$SPEND[myDF$STORE_R == "CENTRAL"])
-mean(myDF$SPEND[myDF$STORE_R == "EAST "])
-mean(myDF$SPEND[myDF$STORE_R == "SOUTH "])
-mean(myDF$SPEND[myDF$STORE_R == "WEST "])
-----
-
-...but it is easier to do all four of these calculations with the `tapply` function. We take a `mean` of the `SPEND` values, broken into groups according to the `STORE_R`:
-
-[source,r]
-----
-tapply( myDF$SPEND, myDF$STORE_R, mean)
-----
-
-We could find the total amount in the `SPEND` column in 2016 and then again in 2017...
-
-[source,r]
-----
-sum(myDF$SPEND[myDF$YEAR == "2016"])
-sum(myDF$SPEND[myDF$YEAR == "2017"])
-----
-
-...or we could do both of these calculations at once, using the `tapply` function. We take the `sum` of all `SPEND` amounts, broken into groups according to the `YEAR`
-
-[source,r]
-----
-tapply(myDF$SPEND, myDF$YEAR, sum)
-----
-
-As a last example, we can calculate the amount spent on each day of purchases.
-We take the `sum` of all `SPEND` amounts, broken into groups according to the `PURCHASE_` day:
-
-[source,r]
-----
-tapply(myDF$SPEND, myDF$PURCHASE_, sum)
-----
-
-[source,r]
-----
-tail(sort( tapply(myDF$SPEND, myDF$PURCHASE_, sum) ),n=20)
-----
-
-It makes sense to sort the results and then look at the 20 days on which the `sum` of the `SPEND` amounts were the highest.
-
-++++
-
-++++
-
-[source,r]
-----
-tapply( mydata, mygroups, myfunction, na.rm=T )
-----
-
-Some generic uses to explain how this would look, if we made the calculations in a naive/verbose/painful way:
-
-[source,r]
-----
-myfunction(mydata[mygroups == 1], na.rm=T)
-myfunction(mydata[mygroups == 2], na.rm=T)
-myfunction(mydata[mygroups == 3], na.rm=T) ....
-myfunction(mydata[mygroups == "IN"], na.rm=T)
-myfunction(mydata[mygroups == "OH"], na.rm=T)
-myfunction(mydata[mygroups == "IL"], na.rm=T) ....
-----
-
-
-[source,r]
-----
-myDF <- read.csv("/class/datamine/data/flights/subset/2005.csv")
-head(myDF)
-----
-
-`sum` all flight `Distance`, split into groups according to the airline (`UniqueCarrier`).
-
-[source,r]
-----
-sort(tapply(myDF$Distance, myDF$UniqueCarrier, sum))
-----
-
-Find the `mean` flight `Distance`, grouped according to the city of `Origin`.
-
-[source,r]
-----
-sort(tapply(myDF$Distance, myDF$Origin, mean))
-----
-
-Calculate the `mean` departure delay (`DepDelay`), for each airplane (i.e., each `TailNum`), using `na.rm=T` because some of the values of the departure delays are `NA`.
-
-[source,r]
-----
-tail(sort(tapply(myDF$DepDelay, myDF$TailNum, mean, na.rm=T)),n=20)
-----
-
-++++
-
-++++
-
-
-[source,r]
-----
-library(data.table)
-myDF <- fread("/class/datamine/data/election/itcont2016.txt", sep="|")
-head(myDF)
-----
-
-`sum` the amounts of all contributions made, grouped according to the `STATE` where the people lived.
-
-[source,r]
-----
-sort(tapply(myDF$TRANSACTION_AMT, myDF$STATE, sum))
-----
-
-`sum` the amounts of all contributions made, grouped according to the `CITY`/`STATE` where the people lived.
-
-[source,r]
-----
-tail(sort(tapply(myDF$TRANSACTION_AMT, paste(myDF$CITY, myDF$STATE), sum)),n=20)
-mylocations <- paste(myDF$CITY, myDF$STATE)
-tail(sort(tapply(myDF$TRANSACTION_AMT, mylocations, sum)),n=20)
-----
-
-`sum` the amounts of all contributions made, grouped according to the `EMPLOYER` where the people worked.
-
-[source,r]
-----
-tail(sort(tapply(myDF$TRANSACTION_AMT, myDF$EMPLOYER, sum)), n=30)
-----
-
-++++
-
-++++
-
-**Motivation:** `tapply` is a powerful function that allows us to group data, and perform calculations on that data in bulk. The "apply suite" of functions provide a fast way of performing operations that would normally require the use of loops. Typically, when writing R code, you will want to use an "apply suite" function rather than a for loop.
-
-**Context:** The past couple of projects have studied the use of loops and/or vectorized operations. In this project, we will introduce a function called `tapply` from the "apply suite" of functions in R.
-
-**Scope:** r, for, tapply
-
-.Learning objectives
-****
-- Explain what "recycling" is in R and predict behavior of provided statements.
-- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- List the differences between lists, vectors, factors, and data.frames, and when to use each.
-- Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc.
-- Demonstrate how apply functions are generally faster than using loops.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/fars/7581.csv`
-
-
-== Questions
-
-[NOTE]
-====
-Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
-====
-
-[NOTE]
-====
-Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
-====
-
-=== Question 1
-
-The dataset, `/class/datamine/data/fars/7581.csv` contains the combined accident records from year 1975 to 1981. Load up the dataset into a data.frame named `dat`. In the previous project's question 4, we asked you to calculate the mean number of motorists involved in an accident (variable `PERSON`) with i drunk drivers where i takes the values from 0 through 6. This time, solve this question using the `tapply` function instead. Which method did you prefer and why?
-
-Now that you've read the data into a dataframe named `dat`, run the following code:
-
-[source,r]
-----
-# Read in data that maps state codes to state names
-state_names <- read.csv("/class/datamine/data/fars/states.csv")
-# Create a vector of state names called v
-v <- state_names$state
-# Set the names of the new vector to the codes
-names(v) <- state_names$code
-# Create a new column in the dat dataframe with the actual names of the states
-dat$mystates <- v[as.character(dat$STATE)]
-----
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The output/solution.
-====
-
-=== Question 2
-
-Make a state-by-state classification of the average number of drunk drivers in an accident. Which state has the highest average number of drunk drivers per accident?
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The entire output.
-- Which state has the highest average number of drunk drivers per accident?
-====
-
-=== Question 3
-
-Add up the total number of fatalities, according to the day of the week on which they occurred. Are the numbers surprising to you? What days of the week have a higher number of fatalities? If instead you calculate the proportion of fatalities over the total number of people in the accidents, what would you expect? Calculate it and see if your expectations match.
-
-[TIP]
-====
-Sundays through Saturdays are days 1 through 7, respectively. Day 9 indicates that the day is unknown.
-====
-
-This video example uses the Amazon fine food reviews dataset to make a similar calculation, in which we have two tapply statements, and we divide the results to get a ton of similar ratios all at once. Powerful stuff! It may guide you in your thinking about this question.
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the problem.
-- What days have the highest number of fatalities?
-- What would you expect if you calculate the proportion of fatalities over the total number of people in the accidents?
-====
-
-=== Question 4
-
-How many drunk drivers are involved, on average, in crashes that occur on straight roads? How many drunk drivers are involved, on average, in crashes that occur on curved roads? Solve the pair of questions in a single line of R code.
-
-[TIP]
-====
-The `ALIGNMNT` variable is 1 for straight, 2 for curved, and 9 for unknown.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- Results from running the R code.
-====
-
-=== Question 5
-
-Break the day into portions, as follows: midnight to 6AM, 6AM to 12 noon, 12 noon to 6PM, 6PM to midnight, other. Find the total number of fatalities that occur during each of these time intervals. Also, find the average number of fatalities per crash that occurs during each of these time intervals.
-
-This example demonstrates a comparable calculation. In the video, I used the total number of people in the accident, and your question is (instead) about the number of fatalities, but this is essentially the only difference. I hope it helps to explain the way that the cut function works, along with the analogous breaks.
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the problem.
-- Results from running the R code.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project07.adoc
deleted file mode 100644
index b3d80d09a..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project07.adoc
+++ /dev/null
@@ -1,166 +0,0 @@
-= STAT 19000: Project 7 -- Fall 2020
-
-**Motivation:** Three bread-and-butter functions that are a part of the base R are: `subset`, `merge`, and `split`. `subset` provides a more natural way to filter and select data from a data.frame. `split` is a useful function that splits a dataset based on one or more factors. `merge` brings the principals of combining data that SQL uses, to R.
-
-**Context:** We've been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with three useful functions, all the while gaining experience and practice wrangling data!
-
-**Scope:** r, subset, merge, split, tapply
-
-.Learning objectives
-****
-- Gain proficiency using split, merge, and subset.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Demonstrate how to use tapply to solve data-driven problems.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/goodreads/csv`
-
-== Questions
-
-[IMPORTANT]
-====
-Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
-====
-
-[IMPORTANT]
-====
-Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
-====
-
-=== Question 1
-
-Load up the following two datasets `goodreads_books.csv` and `goodreads_book_authors.csv` into the data.frames `books` and `authors`, respectively. How many columns and rows are in each of these two datasets?
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result of running the R code.
-====
-
-=== Question 2
-
-We want to figure out how book size (`num_pages`) is associated with various metrics. First, let's create a vector called `book_size`, that categorizes books into 4 categories based on `num_pages`: `small` (up to 250 pages), `medium` (250-500 pages), `large` (500-1000 pages), `huge` (1000+ pages).
-
-[NOTE]
-====
-This [video and code](#r-lapply-flight-example) might be helpful.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result of `table(book_size)`.
-====
-
-=== Question 3
-
-Use `tapply` to calculate the mean `average_rating`, `text_reviews_count`, and `publication_year` by `book_size`. Did any of the result surprise you? Why or why not?
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The output from running the R code.
-====
-
-=== Question 4
-
-Notice in (3) how we used `tapply` 3 times. This would get burdensome if we decided to calculate 4 or 5 or 6 columns instead. Instead of using tapply, we can use `split`, `lapply`, and `colMeans` to perform the same calculations.
-
-Use `split` to partition the data containing only the following 3 columns: `average_rating`, `text_reviews_count`, and `publication_year`, by `book_size`. Save the result as `books_by_size`. What class is the result? `lapply` is a function that allows you to loop over each item in a list and apply a function. Use `lapply` and `colMeans` to perform the same calculation as in (3).
-
-[NOTE]
-====
-This [video and code](#r-lapply-flight-example) and also this [video and code](#r-lapply-fars-example) might be helpful.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The output from running the code.
-====
-
-=== Question 5
-
-We are working with a lot more data than we really want right now. We've provided you with the following code to filter out non-English books and only keep columns of interest. This will create a data frame called `en_books`.
-
-[source,r]
-----
-en_books <- books[books$language_code %in% c("en-US", "en-CA", "en-GB", "eng", "en", "en-IN") & books$publication_year > 2000, c("author_id", "book_id", "average_rating", "description", "title", "ratings_count", "language_code", "publication_year")]
-----
-
-Now create an equivalent data frame of your own, by using the `subset` function (instead of indexing). Use `res` as the name of the data frame that you create.
-Do the dimensions (using `dim`) of `en_books` and `res` agree? Why or why not? (They should both have 8 columns, but a different number of rows.)
-
-[TIP]
-====
-Since the dimensions don't match, take a look at NA values for the variables used to subset our data.
-====
-
-[NOTE]
-====
-This [video and code](#r-subset-8451-example) and also this [video and code](#r-subset-election-example) might be helpful.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- Do the dimensions match?
-- 1-2 sentences explaining why or why not.
-====
-
-=== Question 6
-
-We now have a nice and tidy subset of data, called `res`. It would be really nice to get some information on the authors. We can find that information in `authors` dataset loaded in question 1! In question 2 of the previous project, we had a similar issue with the states names. There is a *much* better and easier way to solve these types of problems. Use the `merge` function to combine `res` and `authors` in a way which appends all information from `author` when there is a match in `res`. Use the condition `by="author_id"` in the merge. This is all you need to do:
-
-[source,r]
-----
-mymergedDF <- merge(res, authors, by="author_id")
-----
-
-[NOTE]
-====
-The resulting data frame will have all of the columns that are found in either `res` or `authors`. When we perform the merge, we only insist that the `author_id` should match. We do not expect that the `ratings_count` or `average_rating` should agree in `res` versus `authors`. Why? In the `res` data frame, the `ratings_count` and `average_rating` refer to the specific book, but in the `authors` data frame, the `ratings_count` and `average_rating` refer to the total works by the author. Therefore, in `mymergedDF`, there are columns `ratings_count.x` and `average_rating.x` from `res`, and there are columns `ratings_count.y` and `average_rating.y` from `authors`.
-====
-
-[NOTE]
-====
-Although we provided the necessary code for this example, you might want to know more about the merge function. This [video and code](#r-merge-fars-example) and also this [video and code](#r-merge-flights-example) might be helpful.
-====
-
-.Items to submit
-====
-- the given R code used to solve the problem.
-- The `dim` of the newly merged data.frame.
-====
-
-=== Question 7
-
-For an author of your choice (that _is_ in the dataset), find the author's highest rated book. Do you agree?
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The title of the highest rated book (from your author).
-- 1-2 sentences explaining why or why not you agree with it being the highest rated book from that author.
-====
-
-=== OPTIONAL QUESTION
-
-Look at the column names of the new dataframe created in question 6. Notice that there are two values for `ratings_count` and two values for `average_rating`. The names that have an appended `x` are those values from the first argument to `merge`, and the names that have an appended `y`, are those values from the second argument to `merge`. Rename these columns to indicate if they refer to a book, or an author.
-
-[TIP]
-====
-For example, `ratings_count.x` could be `ratings_count_book` or `ratings_count_author`.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The `names` of the new data.frame.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project08.adoc
deleted file mode 100644
index 56d4c38a3..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project08.adoc
+++ /dev/null
@@ -1,127 +0,0 @@
-= STAT 19000: Project 8 -- Fall 2020
-
-**Motivation:** A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code!
-
-**Context:** We've been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon.
-
-**Scope:** r, functions
-
-.Learning objectives
-****
-- Gain proficiency using split, merge, and subset.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Demonstrate how to use tapply to solve data-driven problems.
-- Comprehend what a function is, and the components of a function in R.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/goodreads/csv`
-
-== Questions
-
-[IMPORTANT]
-====
-Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
-====
-
-[IMPORTANT]
-====
-Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
-====
-
-=== Question 1
-
-Read in the same data, in the same way as the previous project (with the same names). We've provided you with the function below. How many arguments does the function have? Name all of the arguments. What is the name of the function? Replace the `description` column in our `books` data.frame with the same information, but with stripped punctuation using the function provided.
-
-[source,r]
-----
-# A function that, given a string (myColumn), returns the string
-# without any punctuation.
-strip_punctuation <- function(myColumn) {
- # Use regular expressions to identify punctuation.
- # Replace identified punctuation with an empty string ''.
- desc_no_punc <- gsub('[[:punct:]]+', '', myColumn)
-
- # Return the result
- return(desc_no_punc)
-}
-----
-
-[TIP]
-====
-Since `gsub` accepts a vector of values, you can pass an entire vector to `strip_punctuation`.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- How many arguments does the function have?
-- What are the name(s) of all of the arguments?
-- What is the name of the function?
-====
-
-=== Question 2
-
-Use the `strsplit` function to split a string by spaces. Some examples would be:
-
-[source,r]
-----
-strsplit("This will split by space.", " ")
-strsplit("This. Will. Split. By. A. Period.", "\\.")
-----
-
-An example string is:
-
-[source,r]
-----
-test_string <- "This is a test string with no punctuation"
-----
-
-Test out `strsplit` using the provided `test_string`. Make sure to copy and paste the code that declares `test_string`. If you counted the words shown in your results, would it be an accurate count? Why or why not?
-
-**Relevant topics:** [strsplit](#r-strsplit), [functions](#r-writing-functions)
-
-.Items to submit
-====
-- R code used to solve the problem.
-- 1-2 sentences explaining why or why not your count would be accurate.
-====
-
-=== Question 3
-
-Fix the issue in (2), using `which`. You may need to `unlist` the `strsplit` result first. After you've accomplished this, you can count the remaining words!
-
-.Items to submit
-====
-- R code used to solve the problem (including counting the words).
-====
-
-=== Question 4
-
-We are finally to the point where we have code from questions (2) and (3) that we think we may want to use many times. Write a function called `count_words` which, given a string, `description`, returns the number of words in `description`. Test out `count_words` on the `description` from the second row of `books`. How many words are in the description?
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result of using the function on the `description` from the second row of `books`.
-====
-
-=== Question 5
-
-Practice makes perfect! Write a function of your own design that is intended on being used with one of our datasets. Test it out and share the results.
-
-[NOTE]
-====
-You could even pass (as an argument) one of our datasets to your function and calculate a cool statistic or something like that! Maybe your function makes a plot? Who knows?
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- An example (with output) of using your newly created function.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project09.adoc
deleted file mode 100644
index 046f8d380..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project09.adoc
+++ /dev/null
@@ -1,167 +0,0 @@
-= STAT 19000: Project 9 -- Fall 2020
-
-
-**Motivation:** A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code!
-
-**Context:** We've been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon.
-
-**Scope:** r, functions
-
-.Learning objectives
-****
-- Gain proficiency using split, merge, and subset.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Demonstrate how to use tapply to solve data-driven problems.
-- Comprehend what a function is, and the components of a function in R.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/goodreads/csv`
-
-== Questions
-
-[IMPORTANT]
-====
-Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
-====
-
-[IMPORTANT]
-====
-Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
-====
-
-=== Question 1
-
-We've provided you with a function below. How many arguments does the function have, and what are their names? You can get a `book_id` from the URL of a goodreads book's webpage.
-
-Two examples:
-
-* If you search for the book `Words of Radiance` on goodreads, the `book_id` contained in the url https://www.goodreads.com/book/show/17332218-words-of-radiance#, is 17332218.
-* https://www.goodreads.com/book/show/157993.The_Little_Prince?from_search=true&from_srp=true&qid=JJGqUK9Vp9&rank=1, (the little prince) with a `book_id` of 157993.
-
-Find 2 or 3 `book_id` and test out the function until you get two successes. Explain in words, what the function is doing, and what options you have.
-
-[source,r]
-----
-library(imager)
-books <- read.csv("/class/datamine/data/goodreads/csv/goodreads_books.csv")
-authors <- read.csv("/class/datamine/data/goodreads/csv/goodreads_book_authors.csv")
-get_author_name <- function(my_authors_dataset, my_author_id){
- return(my_authors_dataset[my_authors_dataset$author_id==my_author_id,'name'])
-}
-fun_plot <- function(my_authors_dataset, my_books_dataset, my_book_id, display_cover=T) {
- book_info <- my_books_dataset[my_books_dataset$book_id==my_book_id,]
- all_books_by_author <- my_books_dataset[my_books_dataset$author_id==book_info$author_id,]
- author_name <- get_author_name(my_authors_dataset, book_info$author_id)
-
- img <- load.image(book_info$image_url)
-
- if(display_cover){
- par(mfrow=c(1,2))
- plot(img, axes=FALSE)
- }
-
- plot(all_books_by_author$num_pages, all_books_by_author$average_rating,
- ylim=c(0,5.1), pch=21, bg='grey80',
- xlab='Number of pages', ylab='Average rating',
- main=paste('Books by', author_name))
-
- points(book_info$num_pages, book_info$average_rating,pch=21, bg='orange', cex=1.5)
-}
-----
-
-.Items to submit
-====
-- How many arguments does the function have, and what are their names?
-- The result of using the function on 2-3 `book_id`s.
-- 1-2 sentences explaining what the function does (generally), and what (if any) options the function provides you with.
-====
-
-=== Question 2
-
-You may have encountered a situation where the `my_book_id` was not in our dataset, and hence, didn't get plotted. When writing functions, it is usually best to try and foresee issues like this and have the function fail gracefully, instead of showing some ugly (and sometimes unclear) warning. Add some code at the beginning of our function that checks to see if `my_book_id` is within our dataset, and if it does not exist, prints "Book ID not found.", and exits the function. Test it out on `book_id=123` and `book_id=19063`.
-
-[TIP]
-====
-Run `?stop` to see if that is a function that may be useful.
-====
-
-.Items to submit
-====
-- R code with your new and improved function.
-- The results from `fun_plot(123)`.
-- The results from `fun_plot(19063)`.
-====
-
-=== Question 3
-
-We have this nice `get_author_name` function that accepts a dataset (in this case, our `authors` dataset), and a `book_id` and returns the name of the author. Write a new function called `get_author_id` that accepts an authors name and returns the `author_id` of the author.
-
-You can test your function using some of these examples:
-
-[source,r]
-----
-get_author_id(authors, "Brandon Sanderson") # 38550
-get_author_id(authors, "J.K. Rowling") # 1077326
-----
-
-.Items to submit
-====
-- R code containing your new function.
-- The results of using your new function on a few authors.
-====
-
-=== Question 4
-
-See the function below.
-
-[source,r]
-----
-search_books_for_word <- function(word) {
- return(books[grepl(word, books$description, fixed=T),]$title)
-}
-----
-
-Given a word, `search_books_for_word` returns the titles of books where the provided word is inside the book's description. `search_books_for_word` utilizes the `books` dataset internally. It requires that the `books` dataset has been loaded into the environment prior to running (and with the correct name). By including and referencing objects defined _outside_ of our function's scope _within_ our function (in this case the variable `books`), our `search_books_for_word` function will be more prone to errors, as any changes to those objects may break our function. For example:
-
-[source,r]
-----
-our_function <- function(x) {
- print(paste("Our argument is:", x))
- print(paste("Our variable is:", my_variable))
-}
-# our variable outside the scope of our_function
-my_variable <- "dog"
-# run our_function
-our_function("first")
-# change the variable outside the scope of our function
-my_variable <- "cat"
-# run our_function again
-our_function("second")
-# imagine a scenario where "my_variable" doesn't exist, our_function would break!
-rm(my_variable)
-our_function("third")
-----
-
-Fix our `search_books_for_word` function to accept the `books` dataset as an argument called `my_books_dataset` and utilize `my_books_dataset` within the function instead of the global variable `books`.
-
-.Items to submit
-====
-- R code with your new and improved function.
-- An example using the updated function.
-====
-
-=== Question 5
-
-Write your own custom function. Make sure your function includes at least 2 arguments. If you access one of our datasets from within your function (which you _definitely_ should do), use what you learned in (4), to avoid future errors dealing with scoping. Your function could output a cool plot, interesting tidbits of information, or anything else you can think of. Get creative and make a function that is fun to use!
-
-.Items to submit
-====
-- R code used to solve the problem.
-- Examples using your function with included output.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project10.adoc
deleted file mode 100644
index de1aa08fa..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project10.adoc
+++ /dev/null
@@ -1,249 +0,0 @@
-= STAT 19000: Project 10 -- Fall 2020
-
-**Motivation:** Functions are powerful. They are building blocks to more complex programs and behavior. In fact, there is an entire programming paradigm based on functions called https://en.wikipedia.org/wiki/Functional_programming[functional programming]. In this project, we will learn to _apply_ functions to entire vectors of data using `sapply`.
-
-**Context:** We've just taken some time to learn about and create functions. One of the more common "next steps" after creating a function is to use it on a series of data, like a vector. `sapply` is one of the best ways to do this in R.
-
-**Scope:** r, sapply, functions
-
-.Learning objectives
-****
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Utilize apply functions in order to solve a data-driven problem.
-- Gain proficiency using split, merge, and subset.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/okcupid/filtered`
-
-== Questions
-
-[IMPORTANT]
-====
-Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
-====
-
-[IMPORTANT]
-====
-Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
-====
-
-
-=== Question 1
-
-Load up the the following datasets into data.frames named `users` and `questions`, respectively: `/class/datamine/data/okcupid/filtered/users.csv`, `/class/datamine/data/okcupid/filtered/questions.csv`. This is data from users on OkCupid, an online dating app. In your own words, explain what each file contains and how they are related -- its _always_ a good idea to poke around the data to get a better understanding of how things are structured!
-
-[TIP]
-====
-Be careful, just because a file ends in `.csv`, does _not_ mean it is comma-separated. You can change what separator `read.csv` uses with the `sep` argument. You can use the `readLines` function on a file (say, with `n=10`, for instance), to see the first lines of a file, and determine the character to use with the `sep` argument.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the problem.
-- 1-2 sentences describing what each file contains and how they are related.
-====
-
-=== Question 2
-
-`grep` is an incredibly powerful tool available to us in R. We will learn more about `grep` in the future, but for now, know that a simple application of `grep` is to find a word in a string. In R, `grep` is vectorized and can be applied to an entire vector of strings. Use `grep` to find a question that references "google". What is the question?
-
-++++
-
-++++
-
-++++
-
-++++
-
-[TIP]
-====
-If at first you don't succeed, run `?grep` and check out the `ignore.case` argument.
-====
-
-[TIP]
-====
-To prepare for Question 3, look at the entire row of the `questions` data frame that has the question about google. The first entry on this row tells you the question that you need, in the `users` data frame, while working on Question 3.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The `text` of the question that references Google.
-====
-
-=== Question 3
-
-In (2) we found a pretty interesting question. What is the percentage of users that Google someone before the first date? Does the proportion change by gender (as defined by `gender2`)? How about by `gender_orientation`?
-
-[TIP]
-====
-The two videos posted in Question 2 might help.
-====
-
-[TIP]
-====
-If you look at the column of `users` corresponding to the question identified in (2), you will see that this column of `users` has two possible answers, namely: `"No. Why spoil the mystery?"` and `"Yes. Knowledge is power!"`.
-====
-
-[TIP]
-====
-Use the `tapply` function with three inputs:
-====
-
-the correct column of `users`,
-
-breaking up the data according to `gender2` or according to `gender_orientation`,
-
-and use this as your function in the `tapply`:
-
-`function(x) {prop.table(table(x, useNA="always"))}`
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the code.
-- Written answers to the questions.
-====
-
-=== Question 4
-
-In Project 8, we created a function called `count_words`. Use this function and `sapply` to create a vector which contains the number of words in each row of the column `text` from the `questions` dataframe. Call the new vector `question_length`, and add it as a column to the `questions` dataframe.
-
-[source,r]
-----
-count_words <- function(my_text) {
- my_split_text <- unlist(strsplit(my_text, " "))
-
- return(length(my_split_text[my_split_text!=""]))
-}
-----
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The result of `str(questions)` (this shows how your `questions` data frame looks, after adding the new column called `question_length`).
-====
-
-=== Question 5
-
-Consider this function called `number_of_options` that accepts a data frame (for instance, `questions`)...
-
-[source,r]
-----
-number_of_options <- function(myDF) {
- table(apply(as.matrix(myDF[ ,3:6]), 1, function(x) {sum(!(x==""))}))
-}
-----
-
-...and counts the number of questions that have each possible number of responses. For instance, if we calculate `number_of_options(questions)` we get:
-
-````
- 0 2 3 4
-590 936 519 746
-````
-
-which means that:
-590 questions have 0 possible responses;
-936 questions have 2 possible responses;
-519 questions have 3 possible responses; and
-746 questions have 4 possible responses.
-
-Now use the `split` function to break the data frame `questions` into 7 smaller data frames, according to the value in `questions$Keywords`. Then use the `sapply` function to determine, for each possible value of `questions$Keywords`, the analogous breakdown of questions with different numbers of responses, as we did above.
-
-[TIP]
-====
-You can write:
-
-[source,r]
-----
-mylist <- split(questions, questions$Keywords)
-sapply(mylist, number_of_options)
-----
-====
-
-++++
-
-++++
-
-The way `sapply` works is the the first argument is by default the first argument to your function, the second argument is the function you want applied, and after that you can specify arguments by name. For example:
-
-[source,r]
-----
-test1 <- c(1, 2, 3, 4, NA, 5)
-test2 <- c(9, 8, 6, 5, 4, NA)
-mylist <- list(first=test1, second=test2)
-# for a single vector in the list
-mean(mylist$first, na.rm=T)
-# what if we want to do this for each vector in the list?
-# how do we remove na's?
-sapply(mylist, mean)
-# we can specify the arguments that are for the mean function
-# by naming them after the first two arguments, like this
-sapply(mylist, mean, na.rm=T)
-# in the code shown above, na.rm=T is passed to the mean function
-# just like if you run the following
-mean(mylist$first, na.rm=T)
-mean(mylist$second, na.rm=T)
-# you can include as many arguments to mean as you normally would
-# and in any order. just make sure to name the arguments
-sapply(mylist, mean, na.rm=T, trim=0.5)
-# or sapply(mylist, mean, trim=0.5, na.rm=T)
-# which is similar to
-mean(mylist$first, na.rm=T, trim=0.5)
-mean(mylist$second, na.rm=T, trim=0.5)
-----
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of the running the code.
-====
-
-=== Question 6
-
-_Lots_ of questions are asked in this `okcupid` dataset. Explore the dataset, and either calculate an interesting statistic/result using `sapply`, or generate a graphic (with good x-axis and/or y-axis labels, main labels, legends, etc.), or both! Write 1-2 sentences about your analysis and/or graphic, and explain what you thought you'd find, and what you actually discovered.
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results from running your code.
-- 1-2 sentences about your analysis and/or graphic, and explain what you thought you'd find, and what you actually discovered.
-====
-
-=== OPTIONAL QUESTION
-
-Does it appear that there is an association between the length of the question and whether or not users answered the question? Assume NA means "unanswered". First create a function called `percent_answered` that, given a vector, returns the percentage of values that are not NA. Use `percent_answered` and `sapply` to calculate the percentage of users who answer each question. Plot this result, against the length of the questions.
-
-[TIP]
-====
-`length_of_questions <- questions$question_length[grep("^q", questions$X)]`
-====
-
-[TIP]
-====
-`grep("^q", questions$X)` returns the column index of every column that starts with "q". Use the same trick we used in the previous hint, to subset our `users` data.frame before using `sapply` to apply `percent_answered`.
-====
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The plot.
-- Whether or not you think there may or may not be an association between question length and whether or not the question is answered.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project11.adoc
deleted file mode 100644
index 431ef67be..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project11.adoc
+++ /dev/null
@@ -1,145 +0,0 @@
-= STAT 19000: Project 11 -- Fall 2020
-
-**Motivation:** The ability to understand a problem, know what tools are available to you, and select the right tools to get the job done, takes practice. In this project we will use what you've learned so far this semester to solve data-driven problems. In previous projects, we've directed you towards certain tools. In this project, there will be less direction, and you will have the freedom to choose the tools you'd like.
-
-**Context:** You've learned lots this semester about the R environment. You now have experience using a very balanced "portfolio" of R tools. We will practice using these tools on a set of economic data from Zillow.
-
-**Scope:** R
-
-.Learning objectives
-****
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Utilize apply functions in order to solve a data-driven problem.
-- Gain proficiency using split, merge, and subset.
-- Comprehend what a function is, and the components of a function in R.
-- Demonstrate the ability to use nested apply functions to solve a data-driven problem.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/zillow`
-
-== Questions
-
-=== Question 1
-
-Read `/class/datamine/data/zillow/Zip_time_series.csv` into a data.frame called `zipc`. Look at the `RegionName` column. It is supposed to be a 5-digit zip code. Either fix the column by writing a function and applying it to the column, or take the time to read the `read.csv` documentation by running `?read.csv` and use an argument to make sure that column is not read in as an integer (which is _why_ zip codes starting with `0` lose the leading `0` when being read in).
-
-[TIP]
-====
-This video demonstrates how to read in data and respect the leading zeroes.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the problem.
-- `head` of the `RegionName` column.
-====
-
-=== Question 2
-
-One might assume that the owner of a house tends to value that house more than the buyer. If that was the case, perhaps the median listing price (the price which the seller puts the house on the market, or ask price) would be higher than the ZHVI (Zillow Home Value Index -- essentially an estimate of the home value). For those rows where both `MedianListingPrice_AllHomes` and `ZHVI_AllHomes` have non-NA values, on average how much higher or lower is the median listing price? Can you think of any other reasons why this may be?
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result itself and 1-2 sentences talking about whether or not you can think of any other reasons that may explain the result.
-====
-
-=== Question 3
-
-Convert the `Date` column to a date using `as.Date`. How many years of data do we have in this dataset? Create a line plot with lines for the average `MedianListingPrice_AllHomes` and average `ZHVI_AllHomes` by year. The result should be a single plot with multiple lines on it.
-
-[TIP]
-====
-Here we give two videos to help you with this question. The first video gives some examples about working with dates in R.
-====
-
-++++
-
-++++
-
-[TIP]
-====
-This second video gives an example about how to plot two line graphs at the same time in R.
-====
-
-++++
-
-++++
-
-[TIP]
-====
-For a nice addition, add a dotted vertical line on year 2008 near the housing crisis:
-====
-
-```{r, eval=F}
-abline(v="2008", lty="dotted")
-```
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The results of running the code.
-====
-
-=== Question 4
-
-Read `/class/datamine/data/zillow/State_time_series.csv` into a data.frame called `states`. Calculate the average median listing price by state, and create a map using `plot_usmap` from the `usmap` package that shows the average median price by state.
-
-[TIP]
-====
-We give a full example about how to plot values, by State, on a map.
-====
-
-++++
-
-++++
-
-[TIP]
-====
-In order for `plot_usmap` to work, you must name the column containing states' names to "state".
-====
-
-[TIP]
-====
-To split words like "OhSoCool" into "Oh So Cool", try this: `trimws(gsub('([[:upper:]])', ' \\1', "OhSoCool"))`. This will be useful as you'll need to correct the `RegionName` column at some point in time. Notice that this will not completely fix "DistrictofColumbia". You will need to fix that one manually.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The resulting map.
-====
-
-=== Question 5
-
-Read `/class/datamine/data/zillow/County_time_series.csv` into a data.frame named `counties`. Choose a state (or states) that you would like to "dig down" into county-level data for, and create a plot (or plots) like in (4) that show some interesting statistic by county. You can choose average median listing price if you so desire, however, you don't need to! There are other cool data!
-
-[TIP]
-====
-Make sure that you remember to aggregate your data by `RegionName` so the plot renders correctly.
-====
-
-[TIP]
-====
-`plot_usmap` looks for a column named `fips`. Make sure to rename the `RegionName` column to `fips` prior to passing the data.frame to `plot_usmap`.
-====
-
-[TIP]
-====
-If you get Question 4 working correctly, here are the main differences for Question 5. You need the `regions` to be `"counties"` instead of `"states"`, and you need the `data.frame` to have a column called `fips` instead of `state`. These are the main differences between Question 4 and Question 5.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The resulting map.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project12.adoc
deleted file mode 100644
index 91bdd03ce..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project12.adoc
+++ /dev/null
@@ -1,160 +0,0 @@
-= STAT 19000: Project 12 -- Fall 2020
-
-**Motivation:** In the previous project you were forced to do a little bit of date manipulation. Dates _can_ be very difficult to work with, regardless of the language you are using. `lubridate` is a package within the famous https://www.tidyverse.org/[tidyverse], that greatly simplifies some of the most common tasks one needs to perform with date data.
-
-**Context:** We've been reviewing topics learned this semester. In this project we will continue solving data-driven problems, wrangling data, and creating graphics. We will introduce a https://www.tidyverse.org/[tidyverse] package that adds great stand-alone value when working with dates.
-
-**Scope:** r
-
-.Learning objectives
-****
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Utilize apply functions in order to solve a data-driven problem.
-- Gain proficiency using split, merge, and subset.
-- Demostrate the ability to create basic graphs with default settings.
-- Demonstratre the ability to modify axes labels and titles.
-- Incorporate legends using legend().
-- Demonstrate the ability to customize a plot (color, shape/linetype).
-- Convert strings to dates, and format dates using the lubridate package.
-****
-
-== Questions
-
-=== Question 1
-
-Let's continue our exploration of the Zillow time series data. A useful package for dealing with dates is called `lubridate`. This is part of the famous https://www.tidyverse.org/[tidyverse] suite of packages. Run the code below to load it. Read the `/class/datamine/data/zillow/State_time_series.csv` dataset into a data.frame named `states`. What class and type is the column `Date`?
-
-[source,r]
-----
-library(lubridate)
-----
-
-.Items to submit
-====
-- R code used to solve the question.
-- `class` and `typeof` column `Date`.
-====
-
-=== Question 2
-
-Convert column `Date` to a corresponding date format using `lubridate`. Check that you correctly transformed it by checking its class like we did in question (1). Compare and contrast this method of conversion with the solution you came up with for question (3) in the previous project. Which method do you prefer?
-
-[TIP]
-====
-Take a look at the following functions from `lubridate`: `ymd`, `mdy`, `dym`.
-====
-
-[TIP]
-====
-Here is a video about `ymd`, `mdy`, `dym`
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the question.
-- `class` of modified column `Date`.
-- 1-2 sentences stating which method you prefer (if any) and why.
-====
-
-=== Question 3
-
-Create 3 new columns in `states` called `year`, `month`, `day_of_week` (Sun-Sat) using `lubridate`. Get the frequency table for your newly created columns. Do we have the same amount of data for all years, for all months, and for all days of the week? We did something similar in question (3) in the previous project -- specifically, we broke each date down by year. Which method do you prefer and why?
-
-[TIP]
-====
-Take a look at functions `month`, `year`, `day`, `wday`.
-====
-
-[TIP]
-====
-You may find the argument of `label` in `wday` useful.
-====
-
-[TIP]
-====
-Here is a video about `month`, `year`, `day`, `wday`
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the question.
-- Frequency table for newly created columns.
-- 1-2 sentences answering whether or not we have the same amount of data for all years, months, and days of the week.
-- 1-2 sentences stating which method you prefer (if any) and why.
-====
-
-=== Question 4
-
-Is there a better month or set of months to put your house on the market? Use `tapply` to compare the average `DaysOnZillow_AllHomes` for all months. Make a barplot showing our results. Make sure your barplot includes "all of the fixings" (title, labeled axes, legend if necessary, etc. Make it look good.).
-
-[TIP]
-====
-If you want to have the month's abbreviation in your plot, you may find both the `month.abb` object and the argument `names.arg` in `barplot` useful.
-====
-
-[TIP]
-====
-This video might help with Question 4.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the question.
-- The barplot of the average `DaysOnZillow_AllHomes` for all months.
-- 1-2 sentences answering the question "Is there a better time to put your house on the market?" based on your results.
-====
-
-=== Question 5
-
-Filter the `states` data to contain only years from 2010+ and call it `states2010plus`. Make a lineplot showing the average `DaysOnZillow_AllHomes` by `Date` using `states2010plus` data. Can you spot any trends? Write 1-2 sentences explaining what (if any) trends you see.
-
-.Items to submit
-====
-- R code used to solve the question.
-- The time series lineplot for the average `DaysOnZillow_AllHomes` per date.
-- 1-2 sentences commenting on the patterns found in the plot, and your impressions of it.
-====
-
-=== Question 6
-
-Do homes sell faster in certain states? For the following states: 'California', 'Indiana', 'NewYork' and 'Florida', make a lineplot for `DaysOnZillow_AllHomes` by `Date` with one line per state. Use the `states2010plus` dataset for this question. Make sure to have each state line colored differently, and to add a legend to your plot. Examine the plot and write 1-2 sentences about any observations you have.
-
-[TIP]
-====
-You may want to use the `lines` function to add the lines for different state.
-====
-
-[TIP]
-====
-Make sure to fix the y-axis limits using the `ylim` argument in `plot` to properly show all four lines.
-====
-
-[TIP]
-====
-You may find the argument `col` useful to change the color of your line.
-====
-
-[TIP]
-====
-To make your legend fit, consider using the states abbreviation, and the arguments `ncol` and `cex` of the `legend` function.
-====
-
-.Items to submit
-====
-- R code used to solve the question.
-- The time series lineplot for `DaysOnZillow_AllHomes` per date for the 4 states.
-- 1-2 sentences commenting on the patterns found in the plot, and your answer to the question "Do homes sell faster in certain states rather than others?".
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project13.adoc
deleted file mode 100644
index 2730d53bf..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project13.adoc
+++ /dev/null
@@ -1,198 +0,0 @@
-= STAT 19000: Project 13 -- Fall 2020
-
-**Motivation:** It's important to be able to lookup and understand the documentation of a new function. You may have looked up the documentation of functions like `paste0` or `sapply`, and noticed that in the "usage" section, one of the arguments is an ellipsis (`...`). Well, unless you understand what this does, it's hard to really _get_ it. In this project, we will experiment with ellipsis, and write our own function that utilizes one.
-
-**Context:** We've learned about, used, and written functions in many projects this semester. In this project, we will utilize some of the less-known features of functions.
-
-**Scope:** r, functions
-
-.Learning objectives
-****
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Utilize apply functions in order to solve a data-driven problem.
-- Gain proficiency using split, merge, and subset.
-- Demostrate the ability to create basic graphs with default settings.
-- Demonstratre the ability to modify axes labels and titles.
-- Incorporate legends using legend().
-- Demonstrate the ability to customize a plot (color, shape/linetype).
-- Convert strings to dates, and format dates using the lubridate package.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/beer/`
-
-== Questions
-
-=== Question 1
-
-Read `/class/datamine/data/beer/beers.csv` into a data.frame named `beers`. Read `/class/datamine/data/beer/breweries.csv` into a data.frame named `breweries`. Read `/class/datamine/data/beer/reviews.csv` into a data.frame named `reviews`.
-
-[TIP]
-====
-Notice that `reviews.csv` is a _large_ file. Luckily, you can use a function from the famous `data.table` package called `fread`. The function `fread` is _much_ faster at reading large file compared to `read.csv`. It reads the data into a class called `data.table`. We will learn more about this later on. For now, use `fread` to read in the `reviews.csv` data then convert it from the `data.table` class into a `data.frame` by wrapping the result of `fread` in the `data.frame` function.
-====
-
-[TIP]
-====
-Do not forget to load the `data.table` library before attempeting to use the `fread` function.
-====
-
-Below we show you an example of how fast the `fread` function is compared to`read.csv`.
-
-[source,r]
-----
-microbenchmark(read.csv("/class/datamine/data/beer/reviews.csv", nrows=100000), data.frame(fread("/class/datamine/data/beer/reviews.csv", nrows=100000)), times=5)
-----
-
-```{txt}
-Unit: milliseconds
-expr
-read.csv("/class/datamine/data/beer/reviews.csv", nrows = 1e+05)
-data.frame(fread("/class/datamine/data/beer/reviews.csv", nrows = 1e+05))
- min lq mean median uq max neval
- 5948.6289 6482.3395 6746.8976 7040.5881 7086.6728 7176.2589 5
- 120.7705 122.3812 127.9842 128.7794 133.7695 134.2205 5
-```
-
-[TIP]
-====
-This video demonstrates how to read the `reviews` data using `fread`.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the problem.
-====
-
-=== Question 2
-
-Take some time to explore the datasets. Like many datasets, our data is broken into 3 "tables". What columns connect each table? How many breweries in `breweries` don't have an associated beer in `beers`? How many beers in `beers` don't have an associated brewery in `breweries`?
-
-[TIP]
-====
-We compare lists of names using `sum` or `intersect`. Similar techniques can be used for Question 2.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the problem.
-- A description of columns which connect each of the files.
-- How many breweries don't have an associated beer in `beers`.
-- How many beers don't have an associated brewery in `breweries`.
-====
-
-=== Question 3
-
-Run `?sapply` and look at the usage section for `sapply`. If you look at the description for the `...` argument, you'll see it is "optional arguments to `FUN`". What this means is you can specify additional input for the function you are passing to `sapply`. One example would be passing `T` to `na.rm` in the mean function: `sapply(dat, mean, na.rm=T)`. Use `sapply` and the `strsplit` function to separate the types of breweries (`types`) by commas. Use another `sapply` to loop through your results and count the number of types for each brewery. Be sure to name your final results `n_types`. What is the average amount of services (`n_types`) breweries in IN and MI offer (we are looking for the average of IN and MI _combined_)? Does that surprise you?
-
-[NOTE]
-====
-When you have one `sapply` inside of another, or one loop inside of another, or an if/else statement inside of another, this is commonly referred to as nesting. So when Googling, you can type "nested sapply" or "nested if statements", etc.
-====
-
-[TIP]
-====
-We show, in this video, how to find the average number of parts in a midwesterner's name. Perhaps surprisingly, this same technique will be useful in solving Question 3.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the question.
-- 1-2 sentences answering the average amount of services breweries in Indiana and Michigan offer, and commenting on this answer.
-====
-
-=== Question 4
-
-Write a function called `compare_beers` that accepts a function that you will call `FUN`, and any number of vectors of beer ids. The function, `compare_beers`, should cycle through each vector/groups of `beer_id`s, compute the function, `FUN`, on the subset of `reviews`, and print "Group X: some_score" where X is the number 1+, and some_score is the result of applying `FUN` on the subset of the `reviews` data.
-
-In the example below the function `FUN` is the `median` function and we have two vectors/groups of `beer_id`s passed with c(271781) being group 1 and c(125646, 82352) group 2. Note that even though our example only passes two vectors to our `compare_beers` function, we want to write the function in a way that we could pass as many vectors as we want to.
-
-Example:
-
-[source,r]
-----
-compare_beers(reviews, median, c(271781), c(125646, 82352))
-----
-
-This example gives the output:
-----
-Group 1: 4
-Group 2: 4.56
-----
-
-For your solution to this question, find the behavior of `compare_beers` in this example:
-[source,r]
-----
-compare_beers(reviews, median, c(88,92,7971), c(74986,1904), c(34,102,104,355))
-----
-
-[TIP]
-====
-There are different approaches to this question. You can use for loops or `sapply`. It will probably help to start small and build slowly toward the solution.
-====
-
-[TIP]
-====
-This first video shows how to use `...` in defining a function.
-====
-
-++++
-
-++++
-
-[TIP]
-====
-This second video basically walks students through how to build this function. If you use this video to learn how to build this function, please be sure to acknowledge this in your project solutions.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result from running the provided example.
-====
-
-=== Question 5
-
-Beer wars! IN and MI against AZ and CO. Use the function you wrote in question (4) to compare beer_id from each group of states. Make a cool plot of some sort. Be sure to comment on your plot.
-
-[TIP]
-====
-Create a vector of `beer_ids` per group before passing it to your function from (4).
-====
-
-[TIP]
-====
-This video demonstrates an example of how to use the `compare_beers` function.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result from running your function.
-- The resulting plot.
-- 1-2 sentence commenting on your plot.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project14.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project14.adoc
deleted file mode 100644
index 91f5e5a79..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project14.adoc
+++ /dev/null
@@ -1,131 +0,0 @@
-= STAT 19000: Project 14 -- Fall 2020
-
-**Motivation:** Functions are the building blocks of more complex programming. It's vital that you understand how to read and write functions. In this project we will incrementally build and improve upon a function designed to recommend a beer. Note that you will not be winning any awards for this recommendation system, it is just for fun!
-
-**Context:** One of the main focuses throughout the semester has been on functions, and for good reason. In this project we will continue to exercise our R skills and build up our recommender function.
-
-**Scope:** r, functions
-
-.Learning objectives
-****
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Utilize apply functions in order to solve a data-driven problem.
-- Gain proficiency using split, merge, and subset.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/beer/`
-
-== Questions
-
-=== Question 1
-
-Read `/class/datamine/data/beer/beers.csv` into a data.frame named `beers`. Read `/class/datamine/data/beer/breweries.csv` into a data.frame named `breweries`. Read `/class/datamine/data/beer/reviews.csv` into a data.frame named `reviews`. As in the previous project, make sure you used the `fread` function from the `data.table` package, and convert the `data.table` to a `data.frame`. We want to create a very basic beer recommender. We will start simple. Create a function called `recommend_a_beer` that takes as input `my_beer_id` (a single value) and returns a vector of `beer_ids` from the same `style`. Test your function on `2093`.
-
-[TIP]
-====
-Make sure you do not include the given `my_beer_id` in the vector of `beer_ids` containing the `beer_ids`of your recommended beers.
-====
-
-[TIP]
-====
-You may find the function `setdiff` useful. Run the example below to get an idea of what it does.
-====
-
-[NOTE]
-====
-You will not win any awards for this recommendation system!
-====
-
-[source,r]
-----
-x <- c('a','b','b','c')
-y <- c('c','b','d','e','f')
-setdiff(x,y)
-setdiff(y,x)
-----
-
-.Items to submit
-====
-- R code used to solve the problem.
-- Length of result from `recommend_a_beer(2093)`.
-- The result of `2093 %in% recommend_a_beer(2093)`.
-====
-
-=== Question 2
-
-That is a lot of beer recommendations! Let's try to narrow it down. Include an argument in your function called `min_score` with default value of 4.5. Our recommender will only recommend `beer_ids` with a review score of at least `min_score`. Test your improved beer recommender with the same `beer_id` from question (1).
-
-[TIP]
-====
-Note that now we need to look at both `beers` and `reviews` datasets.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- Length of result from `recommend_a_beer(2093)`.
-====
-
-=== Question 3
-
-There is still room for improvement (obviously) for our beer recommender. Include a new argument in your function called `same_brewery_only` with default value `FALSE`. This argument will determine whether or not our beer recommender will return only beers from the same brewery. Test our newly improved beer recommender with the same `beer_id` from question (1) with the argument `same_brewery_only` set to `TRUE`.
-
-[TIP]
-====
-You may find the function `intersect` useful. Run the example below to get an idea of what it does.
-
-[source,r]
-----
-x <- c('a','b','b','c')
-y <- c('c','b','d','e','f')
-intersect(x,y)
-intersect(y,x)
-----
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- Length of result from `recommend_a_beer(2093, same_brewery_only=TRUE)`.
-====
-
-=== Question 4
-
-Oops! Bad idea! Maybe including only beers from the same brewery is not the best idea. Add an argument to our beer recommender named `type`. If `type=style` our recommender will recommend beers based on the `style` as we did in question (3). If `type=reviewers`, our recommender will recommend beers based on reviewers with "similar taste". Select reviewers that gave score equal to or greater than `min_score` for the given beer id (`my_beer_id`). For those reviewers, find the `beer_ids` for other beers that these reviewers have given a score of at least `min_score`. These `beer_ids` are the ones our recommender will return. Be sure to test our improved recommender on the same `beer_id` as in (1)-(3).
-
-.Items to submit
-====
-- R code used to solve the problem.
-- Length of result from `recommend_a_beer(2093, type="reviewers")`.
-====
-
-=== Question 5
-
-Let's try to narrow down the recommendations. Include an argument called `abv_range` that indicates the abv range we would like the recommended beers to be at. Set `abv_range` default value to `NULL` so that if a user does not specify the `abv_range` our recommender does not consider it. Test our recommender for `beer_id` 2093, with `abv_range = c(8.9,9.1)` and `min_score=4.9`.
-
-[TIP]
-====
-You may find the function `is.null` useful.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- Length of result from `recommend_a_beer(2093, abv_range=c(8.9, 9.1), type="reviewers", min_score=4.9)`.
-====
-
-=== Question 6
-
-Play with our `recommend_a_beer` function. Include another feature to it. Some ideas are: putting a limit on the number of `beer_id`s we will return, error catching (what if we don't have reviews for a given `beer_id`?), including a plot to the output, returning beer names instead of ids or new arguments to decide what `beer_id`s to recommend. Be creative and have fun!
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result from running the improved `recommend_a_beer` function showcasing your improvements to it.
-- 1-2 sentecens commenting on what you decided to include and why.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project15.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project15.adoc
deleted file mode 100644
index 01008df94..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project15.adoc
+++ /dev/null
@@ -1,130 +0,0 @@
-= STAT 19000: Project 15 -- Fall 2020
-
-**Motivation:** Some people say it takes 20 hours to learn a skill, some say 10,000 hours. What is certain is it definitely takes time. In this project we will explore an interesting dataset and exercise some of the skills learned this semester.
-
-**Context:** This is the final project of the semester. We sincerely hope that you've learned something, and that we've provided you with first hand experience digging through data.
-
-**Scope:** r
-
-.Learning objectives
-****
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Utilize apply functions in order to solve a data-driven problem.
-- Gain proficiency using split, merge, and subset.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/donorschoose/`
-
-== Questions
-
-=== Question 1
-
-Read the data `/class/datamine/data/donorschoose/Projects.csv` into a data.frame called `projects`. Make sure you use the function you learned in Project 13 (`fread`) from the `data.table` package to read the data. Don't forget to then convert the `data.table` into a `data.frame`. Let's do an initial exploration of this data. What types of projects (`Project.Type`) are there? How many resource categories (`Project.Resource.Category`) are there?
-
-.Items to submit
-====
-- R code used to solve the question.
-- 1-2 sentences containing the project's types and how many resource categories are in the dataset.
-====
-
-=== Question 2
-
-Create two new variables in `projects`, the number of days a project lasted and the number of days until the project was fully funded. Name those variables `project_duration` and `time_until_funded`, respectively. To calculate them use the project's posted date (`Project.Posted.Date`), expiration date (`Project.Expiration.Date`), and fully funded date (`Project.Fully.Funded.Date`). What are the shortest and longest times until a project is fully funded? For consistency check, see if we have any negative project's duration. If so, how many?
-
-[TIP]
-====
-You _may_ find the argument `units` in the function `difftime` useful.
-====
-
-[TIP]
-====
-Be sure to pay attention to the order of operations of `difftime`.
-====
-
-[TIP]
-====
-Note that if you used the `fread` function from `data.table` to read in the data, you will not need to convert the columns as date.
-====
-
-[TIP]
-====
-It is _not_ required that you use `difftime`.
-====
-
-.Items to submit
-====
-- R code used to solve the question.
-- Shortest and longest times until a project is fully funded.
-- 1-2 sentences answering whether we have if we have negative project's duration, and if so how many.
-====
-
-=== Question 3
-
-As you noted in question (2) there may be some project's with negative duration time. As we may have some concerns for the data regarding these projects, filter the `projects` data to exclude the projects with negative duration, and call this filtered data `selected_projects`. With that filtered data, make a `dotchart` for mean time until the project is fully funded (`time_until_funded`) for the various resource categories (`Project.Resource.Category`). Make sure to comment on your results. Are they surprising? Could there be another variable influencing this result? If so, name at least one.
-
-[TIP]
-====
-You will first need to average time until fully funded for the different categories before making your plot.
-====
-
-[TIP]
-====
-To make your `dotchart` look nicer, you may want to first order the average time until fully funded before passing it to the `dotchart` function. In addition, consider reducing the y-axis font size using the argument `cex`.
-====
-
-.Items to submit
-====
-- R code used to solve the question.
-- Resulting dotchart.
-- 1-2 sentences commenting on your plot. Make sure to mention whether you are surprised or not by the results. Don't forget to add if you think there could be more factors influencing your answer, and if so, be sure to give examples.
-====
-
-=== Question 4
-
-Read `/class/datamine/data/donorschoose/Schools.csv` into a data.frame called `schools`. Combine `selected_projects` and `schools` by `School.ID` keeping only `School.ID`s present in both datasets. Name the combined data.frame `selected_projects`. Use the newly combined data to determine the percentage of already fully funded projects (`Project.Current.Status`) for schools in West Lafayette, IN. In addition, determine the state (`School.State`) with the highest number of projects. Be sure to specify the number of projects this state has.
-
-[TIP]
-====
-West Lafayette, IN zip codes are 47906 and 47907.
-====
-
-.Items to submit
-====
-- R code used to solve the question.
-- 1-2 sentences answering the percentage of already fully funded projects for schools in West Lafayette, IN, the state with the highest number of projects, and the number of projects this state has.
-====
-
-=== Question 5
-
-Using the combined `selected_projects` data, get the school(s) (`School.Name`), city/cities (`School.City`) and state(s) (`School.State`) for the teacher with the highest percentage of fully funded projects (`Project.Current.Status`).
-
-[TIP]
-====
-There are many ways to solve this problem. For example, one option to get the teacher's ID is to create a variable indicating whether or not the project is fully funded and use `tapply`. Another option is to create `prop.table` and select the corresponding column/row.
-====
-
-[TIP]
-====
-Note that each row in the data corresponds to a unique project ID.
-====
-
-[TIP]
-====
-Once you have the teacher's ID, consider filtering `projects` to contain only rows for which the corresponding teacher's ID is in, and only the columns we are interested in: `School.Name`, `School.City`, and `School.State`. Then, you can get the unique values in this shortened data.
-====
-
-[TIP]
-====
-To get only certain columns when subetting, you may find the argument `select` from `subset` useful.
-====
-
-.Items to submit
-====
-- R code used to solve the question.
-- Output of your code containing school(s), city(s) and state(s) of the selected teacher.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project01.adoc
deleted file mode 100644
index 85637e788..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project01.adoc
+++ /dev/null
@@ -1,167 +0,0 @@
-= STAT 29000: Project 1 -- Fall 2020
-
-**Motivation:** In this project we will jump right into an R review. In this project we are going to break one larger data-wrangling problem into discrete parts. There is a slight emphasis on writing functions and dealing with strings. At the end of this project we will have greatly simplified a dataset, making it easy to dig into.
-
-**Context:** We just started the semester and are digging into a large dataset, and in doing so, reviewing R concepts we've previously learned.
-
-**Scope:** data wrangling in R, functions
-
-.Learning objectives
-****
-- Comprehend what a function is, and the components of a function in R.
-- Read and write basic (csv) data.
-- Utilize apply functions in order to solve a data-driven problem.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-You can find useful examples that walk you through relevant material in The Examples Book:
-
-https://thedatamine.github.io/the-examples-book
-
-It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
-
-[IMPORTANT]
-====
-It is highly recommended that you use https://rstudio.scholar.rcac.purdue.edu/. Simply click on the link and login using your Purdue account credentials.
-====
-
-We decided to move away from ThinLinc and away from the version of RStudio used last year (https://desktop.scholar.rcac.purdue.edu). That version of RStudio is known to have some strange issues when running code chunks.
-
-Remember the very useful documentation shortcut `?`. To use, simply type `?` in the console, followed by the name of the function you are interested in.
-
-You can also look for package documentation by using `help(package=PACKAGENAME)`, so for example, to see the documentation for the package `ggplot2`, we could run:
-
-[source,r]
-----
-help(package=ggplot2)
-----
-
-Sometimes it can be helpful to see the source code of a defined function. A https://www.tutorialspoint.com/r/r_functions.htm[function] is any chunk of organized code that is used to perform an operation. Source code is the underlying `R` or `c` or `c++` code that is used to create the function. To see the source code of a defined function, type the function's name without the `()`. For example, if we were curious about what the function `Reduce` does, we could run:
-
-[source,r]
-----
-Reduce
-----
-
-Occasionally this will be less useful as the resulting code will be code that calls `c` code we can't see. Other times it will allow you to understand the function better.
-
-== Dataset(s)
-
-`/class/datamine/data/airbnb`
-
-Often times (maybe even the majority of the time) data doesn't come in one nice file or database. Explore the datasets in `/class/datamine/data/airbnb`.
-
-== Questions
-
-=== Question 1
-
-You may have noted that, for each country, city, and date we can find 3 files: `calendar.csv.gz`, `listings.csv.gz`, and `reviews.csv.gz` (for now, we will ignore all files in the "visualisations" folders).
-
-Let's take a look at the data in each of the three types of files. Pick a country, city and date, and read the first 50 rows of each of the 3 datasets (`calendar.csv.gz`, `listings.csv.gz`, and `reviews.csv.gz`). Provide 1-2 sentences explaining the type of information found in each, and what variable(s) could be used to join them.
-
-[TIP]
-====
-`read.csv` has an argument to select the number of rows we want to read.
-====
-
-[TIP]
-====
-Depending on the country that you pick, the listings and/or the reviews might not display properly in RMarkdown. So you do not need to display the first 50 rows of the listings and/or reviews, in your RMarkdown document. It is OK to just display the first 50 rows of the calendar entries.
-====
-
-.Items to submit
-====
-- Chunk of code used to read the first 50 rows of each dataset.
-- 1-2 sentences briefly describing the information contained in each dataset.
-- Name(s) of variable(s) that could be used to join them.
-====
-
-To read a compressed csv, simply use the `read.csv` function:
-
-[source,r]
-----
-dat <- read.csv("/class/datamine/data/airbnb/brazil/rj/rio-de-janeiro/2019-06-19/data/calendar.csv.gz")
-head(dat)
-----
-
-Let's work towards getting this data into an easier format to analyze. From now on, we will focus on the `listings.csv.gz` datasets.
-
-=== Question 2
-
-Write a function called `get_paths_for_country`, that, given a string with the country name, returns a vector with the full paths for all `listings.csv.gz` files, starting with `/class/datamine/data/airbnb/...`.
-
-For example, the output from `get_paths_for_country("united-states")` should have 28 entries. Here are the first 5 entries in the output:
-
-----
- [1] "/class/datamine/data/airbnb/united-states/ca/los-angeles/2019-07-08/data/listings.csv.gz"
- [2] "/class/datamine/data/airbnb/united-states/ca/oakland/2019-07-13/data/listings.csv.gz"
- [3] "/class/datamine/data/airbnb/united-states/ca/pacific-grove/2019-07-01/data/listings.csv.gz"
- [4] "/class/datamine/data/airbnb/united-states/ca/san-diego/2019-07-14/data/listings.csv.gz"
- [5] "/class/datamine/data/airbnb/united-states/ca/san-francisco/2019-07-08/data/listings.csv.gz"
-----
-
-[TIP]
-====
-`list.files` is useful with the `recursive=T` option.
-====
-
-[TIP]
-====
-Use `grep` to search for the pattern `listings.csv.gz` (within the results from the first hint), and use the option `value=T` to display the values found by the `grep` function.
-====
-
-.Items to submit
-====
-- Chunk of code for your `get_paths_for_country` function.
-====
-
-=== Question 3
-
-Write a function called `get_data_for_country` that, given a string with the country name, returns a data.frame containing the all listings data for that country. Use your previously written function to help you.
-
-[TIP]
-====
-Use `stringsAsFactors=F` in the `read.csv` function.
-====
-
-[TIP]
-====
-Use `do.call(rbind, )` to combine a list of dataframes into a single dataframe.
-====
-
-.Items to submit
-====
-- Chunk of code for your `get_data_for_country` function.
-====
-
-=== Question 4
-
-Use your `get_data_for_country` to get the data for a country of your choice, and make sure to name the data.frame `listings`. Take a look at the following columns: `host_is_superhost`, `host_has_profile_pic`, `host_identity_verified`, and `is_location_exact`. What is the data type for each column? (You can use `class` or `typeof` or `str` to see the data type.)
-
-These columns would make more sense as logical values (TRUE/FALSE/NA).
-
-Write a function called `transform_column` that, given a column containing lowercase "t"s and "f"s, your function will transform it to logical (TRUE/FALSE/NA) values. Note that NA values for these columns appear as blank (`""`), and we need to be careful when transforming the data. Test your function on column `host_is_superhost`.
-
-.Items to submit
-====
-- Chunk of code for your `transform_column` function.
-- Type of `transform_column(listings$host_is_superhost)`.
-====
-
-=== Question 5
-
-Apply your function `transform_column` to the columns `instant_bookable` and `is_location_exact` in your `listings` data.
-
-Based on your `listings` data, if you are looking at an instant bookable listing (where `instant_bookable` is `TRUE`), would you expect the location to be exact (where `is_location_exact` is `TRUE`)? Why or why not?
-
-[TIP]
-====
-Make a frequency table, and see how many instant bookable listings have exact location.
-====
-
-.Items to submit
-====
-- Chunk of code to get a frequency table.
-- 1-2 sentences explaining whether or not we would expect the location to be exact if we were looking at a instant bookable listing.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project02.adoc
deleted file mode 100644
index 8000feb06..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project02.adoc
+++ /dev/null
@@ -1,167 +0,0 @@
-= STAT 29000: Project 2 -- Fall 2020
-
-**Motivation:** The ability to quickly reproduce an analysis is important. It is often necessary that other individuals will need to be able to understand and reproduce an analysis. This concept is so important there are classes solely on reproducible research! In fact, there are papers that investigate and highlight the lack of reproducibility in various fields. If you are interested in reading about this topic, a good place to start is the paper titled https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124["Why Most Published Research Findings Are False"] by John Ioannidis (2005).
-
-**Context:** Making your work reproducible is extremely important. We will focus on the computational part of reproducibility. We will learn RMarkdown to document your analyses so others can easily understand and reproduce the computations that led to your conclusions. Pay close attention as future project templates will be RMarkdown templates.
-
-**Scope:** Understand Markdown, RMarkdown, and how to use it to make your data analysis reproducible.
-
-.Learning objectives
-****
-- Use Markdown syntax within an Rmarkdown document to achieve various text transformations.
-- Use RMarkdown code chunks to display and/or run snippets of code.
-****
-
-== Questions
-
-++++
-
-++++
-
-=== Question 1
-
-Make the following text (including the asterisks) bold: `This needs to be **very** bold`. Make the following text (including the underscores) italicized: `This needs to be _very_ italicized.`
-
-[IMPORTANT]
-====
-Surround your answer in 4 backticks. This will allow you to display the markdown _without_ having the markdown "take effect". For example:
-====
-
-`````markdown
-````
-Some *marked* **up** text.
-````
-`````
-
-[TIP]
-====
-Be sure to check out the https://rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf[Rmarkdown Cheatsheet] and our section on https://thedatamine.github.io/the-examples-book/r.html#r-rmarkdown[Rmarkdown in the book].
-====
-
-[NOTE]
-====
-Rmarkdown is essentially Markdown + the ability to run and display code chunks. In this question, we are actually using Markdown within Rmarkdown!
-====
-
-.Items to submit
-====
-- 2 lines of markdown text, surrounded by 4 backticks. Note that when compiled, this text will be unmodified, regular text.
-====
-
-=== Question 2
-
-Create an unordered list of your top 3 favorite academic interests (some examples could include: machine learning, operating systems, forensic accounting, etc.). Create another *ordered* list that ranks your academic interests in order of most interested to least interested.
-
-[TIP]
-====
-You can learn what ordered and unordered lists are https://rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf[here].
-====
-
-[NOTE]
-====
-Similar to (1), in this question we are dealing with Markdown. If we were to copy and paste the solution to this problem in a Markdown editor, it would be the same result as when we Knit it here.
-====
-
-.Items to submit
-====
-- Create the lists, this time don't surround your code in backticks. Note that when compiled, this text will appear as nice, formatted lists.
-====
-
-=== Question 3
-
-Browse https://www.linkedin.com/ and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown. Include the following:
-
-- A header for this section (your choice of size) that says "About".
-- The text of your personal "About" section that you would feel comfortable uploading to linkedin, including at least 1 link.
-
-.Items to submit
-====
-- Create the described profile, don't surround your code in backticks.
-====
-
-=== Question 4
-
-Your co-worker wrote a report, and has asked you to beautify it. Knowing Rmarkdown, you agreed. Make improvements to this section. At a minimum:
-
-- Make the title pronounced.
-- Make all links appear as a word or words, rather than the long-form URL.
-- Organize all code into code chunks where code and output are displayed. If the output is really long, just display the code.
-- Make the calls to the `library` function be evaluated but not displayed.
-- Make sure all warnings and errors that may eventually occur, do not appear in the final document.
-
-Feel free to make any other changes that make the report more visually pleasing.
-
-````markdown
-`r ''````{r my-load-packages}
-library(ggplot2)
-```
-
-`r ''````{r declare-variable-290, eval=FALSE}
-my_variable <- c(1,2,3)
-```
-
-All About the Iris Dataset
-
-This paper goes into detail about the `iris` dataset that is built into r. You can find a list of built-in datasets by visiting https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html or by running the following code:
-
-data()
-
-The iris dataset has 5 columns. You can get the names of the columns by running the following code:
-
-names(iris)
-
-Alternatively, you could just run the following code:
-
-iris
-
-The second option provides more detail about the dataset.
-
-According to https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html[this R manual], there is another dataset built-in to R called `iris3`. This dataset is 3 dimensional instead of 2 dimensional.
-
-An iris is a really pretty flower. You can see a picture of one here:
-
-https://www.gardenia.net/storage/app/public/guides/detail/83847060_mOptimized.jpg
-
-In summary. I really like irises, and there is a dataset in R called `iris`.
-````
-
-.Items to submit
-====
-- Make improvements to this section, and place it all under the Question 4 header in your template.
-====
-
-=== Question 5
-
-Create a plot using a built-in dataset like `iris`, `mtcars`, or `Titanic`, and display the plot using a code chunk. Make sure the code used to generate the plot is hidden. Include a descriptive caption for the image. Make sure to use an RMarkdown chunk option to create the caption.
-
-.Items to submit
-====
-- Code chunk under that creates and displays a plot using a built-in dataset like `iris`, `mtcars`, or `Titanic`.
-====
-
-=== Question 6
-
-Insert the following code chunk under the Question 6 header in your template. Try knitting the document. Two things will go wrong. What is the first problem? What is the second problem?
-
-````markdown
-```{r my-load-packages}`r ''`
-plot(my_variable)
-```
-````
-
-[TIP]
-====
-Take a close look at the name we give our code chunk.
-====
-
-[TIP]
-====
-Take a look at the code chunk where `my_variable` is declared.
-====
-
-.Items to submit
-====
-- The modified version of the inserted code that fixes both problems.
-- A sentence explaining what the first problem was.
-- A sentence explaining what the second problem was.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project03.adoc
deleted file mode 100644
index f331ec5d4..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project03.adoc
+++ /dev/null
@@ -1,212 +0,0 @@
-= STAT 29000: Project 3 -- Fall 2020
-
-**Motivation:** The ability to navigate a shell, like `bash`, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful `bash` tools, help you navigate a filesystem, and even run `bash` tools from within an RMarkdown file in RStudio.
-
-**Context:** At this point in time, you will each have varying levels of familiarity with Scholar. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within RStudio in an RMarkdown file.
-
-**Scope:** bash, RStudio
-
-.Learning objectives
-****
-- Distinguish differences in /home, /scratch, and /class.
-- Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.
-- Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.
-- Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.
-- Utilize other Scholar resources: rstudio.scholar.rcac.purdue.edu, notebook.scholar.rcac.purdue.edu, desktop.scholar.rcac.purdue.edu, etc.
-- Use `man` to read and learn about UNIX utilities.
-- Run `bash` commands from within and RMarkdown file in RStudio.
-****
-
-There are a variety of ways to connect to Scholar. In this class, we will _primarily_ connect to RStudio Server by opening a browser and navigating to https://rstudio.scholar.rcac.purdue.edu/, entering credentials, and using the excellent RStudio interface.
-
-Here is a video to remind you about some of the basic tools you can use in UNIX/Linux:
-
-++++
-
-++++
-
-This is the easiest book for learning this stuff; it is short and gets right to the point:
-
-https://learning.oreilly.com/library/view/learning-the-unix/0596002610
-
-You just log in and you can see it all; we suggest Chapters 1, 3, 4, 5, 7 (you can basically skip chapters 2 and 6 the first time through).
-
-It is a very short read (maybe, say, 2 or 3 hours altogether?), just a thin book that gets right to the details.
-
-== Questions
-
-=== Question 1
-
-Navigate to https://rstudio.scholar.rcac.purdue.edu/ and login. Take some time to click around and explore this tool. We will be writing and running Python, R, SQL, and `bash` all from within this interface. Navigate to `Tools > Global Options ...`. Explore this interface and make at least 2 modifications. List what you changed.
-
-Here are some changes Kevin likes:
-
-- Uncheck "Restore .Rdata into workspace at startup".
-- Change tab width 4.
-- Check "Soft-wrap R source files".
-- Check "Highlight selected line".
-- Check "Strip trailing horizontal whitespace when saving".
-- Uncheck "Show margin".
-
-(Dr Ward does not like to customize his own environment, but he does use the emacs key bindings: Tools > Global Options > Code > Keybindings, but this is only recommended if you already know emacs.)
-
-.Items to submit
-====
-- List of modifications you made to your Global Options.
-====
-
-=== Question 2
-
-There are four primary panes, each with various tabs. In one of the panes there will be a tab labeled "Terminal". Click on that tab. This terminal by default will run a `bash` shell right within Scholar, the same as if you connected to Scholar using ThinLinc, and opened a terminal. Very convenient!
-
-What is the default directory of your bash shell?
-
-[TIP]
-====
-Start by reading the section on `man`. `man` stands for manual, and you can find the "official" documentation for the command by typing `man `. For example:
-====
-
-```{bash, eval=F}
-# read the manual for the `man` command
-# use "k" or the up arrow to scroll up, "j" or the down arrow to scroll down
-man man
-```
-
-.Items to submit
-====
-- The full filepath of default directory (home directory). Ex: Kevin's is: `/home/kamstut`
-- The `bash` code used to show your home directory or current directory (also known as the working directory) when the `bash` shell is first launched.
-====
-
-=== Question 3
-
-Learning to navigate away from our home directory to other folders, and back again, is vital. Perform the following actions, in order:
-
-- Write a single command to navigate to the folder containing our full datasets: `/class/datamine/data`.
-- Write a command to confirm you are in the correct folder.
-- Write a command to list the files and directories within the data directory. (You do not need to recursively list subdirectories and files contained therein.) What are the names of the files and directories?
-- Write another command to return back to your home directory.
-- Write a command to confirm you are in the correct folder.
-
-Note: `/` is commonly referred to as the root directory in a linux/unix filesystem. Think of it as a folder that contains _every_ other folder in the computer. `/home` is a folder within the root directory. `/home/kamstut` is the full filepath of Kevin's home directory. There is a folder `home` inside the root directory. Inside `home` is another folder named `kamstut` which is Kevin's home directory.
-
-.Items to submit
-====
-- Command used to navigate to the data directory.
-- Command used to confirm you are in the data directory.
-- Command used to list files and folders.
-- List of files and folders in the data directory.
-- Command used to navigate back to the home directory.
-- Command used to confirm you are in the home directory.
-====
-
-=== Question 4
-
-Let's learn about two more important concepts. `.` refers to the current working directory, or the directory displayed when you run `pwd`. Unlike `pwd` you can use this when navigating the filesystem! So, for example, if you wanted to see the contents of a file called `my_file.txt` that lives in `/home/kamstut` (so, a full path of `/home/kamstut/my_file.txt`), and you are currently in `/home/kamstut`, you could run: `cat ./my_file.txt`.
-
-`..` represents the parent folder or the folder in which your current folder is contained. So let's say I was in `/home/kamstut/projects/` and I wanted to get the contents of the file `/home/kamstut/my_file.txt`. You could do: `cat ../my_file.txt`.
-
-When you navigate a directory tree using `.` and `..` you create paths that are called _relative_ paths because they are _relative_ to your current directory. Alternatively, a _full_ path or (_absolute_ path) is the path starting from the root directory. So `/home/kamstut/my_file.txt` is the _absolute_ path for `my_file.txt` and `../my_file.txt` is a _relative_ path. Perform the following actions, in order:
-
-- Write a single command to navigate to the data directory.
-- Write a single command to navigate back to your home directory using a _relative_ path. Do not use `~` or the `cd` command without a path argument.
-
-.Items to submit
-====
-- Command used to navigate to the data directory.
-- Command used to navigate back to your home directory that uses a _relative_ path.
-====
-
-=== Question 5
-
-In Scholar, when you want to deal with _really_ large amounts of data, you want to access scratch (you can read more https://www.rcac.purdue.edu/policies/scholar/[here]). Your scratch directory on Scholar is located here: `/scratch/scholar/$USER`. `$USER` is an environment variable containing your username. Test it out: `echo /scratch/scholar/$USER`. Perform the following actions:
-
-- Navigate to your scratch directory.
-- Confirm you are in the correct location.
-- Execute `myquota`.
-- Find the location of the `myquota` bash script.
-- Output the first 5 and last 5 lines of the bash script.
-- Count the number of lines in the bash script.
-- How many kilobytes is the script?
-
-[TIP]
-====
-You could use each of the commands in the relevant topics once.
-====
-
-[TIP]
-====
-When you type `myquota` on Scholar there are sometimes two warnings about `xauth` but sometimes there are no warnings. If you get a warning that says `Warning: untrusted X11 forwarding setup failed: xauth key data not generated` it is safe to ignore this error.
-====
-
-[TIP]
-====
-Commands often have _options_. _Options_ are features of the program that you can trigger specifically. You can see the _options_ of a command in the `DESCRIPTION` section of the `man` pages. For example: `man wc`. You can see `-m`, `-l`, and `-w` are all options for `wc`. To test this out:
-====
-
-```{bash, eval=F}
-# using the default wc command. "/class/datamine/data/flights/1987.csv" is the first "argument" given to the command.
-wc /class/datamine/data/flights/1987.csv
-# to count the lines, use the -l option
-wc -l /class/datamine/data/flights/1987.csv
-# to count the words, use the -w option
-wc -w /class/datamine/data/flights/1987.csv
-# you can combine options as well
-wc -w -l /class/datamine/data/flights/1987.csv
-# some people like to use a single tack `-`
-wc -wl /class/datamine/data/flights/1987.csv
-# order doesn't matter
-wc -lw /class/datamine/data/flights/1987.csv
-```
-
-[TIP]
-====
-The `-h` option for the `du` command is useful.
-====
-
-.Items to submit
-====
-- Command used to navigate to your scratch directory.
-- Command used to confirm your location.
-- Output of `myquota`.
-- Command used to find the location of the `myquota` script.
-- Absolute path of the `myquota` script.
-- Command used to output the first 5 lines of the `myquota` script.
-- Command used to output the last 5 lines of the `myquota` script.
-- Command used to find the number of lines in the `myquota` script.
-- Number of lines in the script.
-- Command used to find out how many kilobytes the script is.
-- Number of kilobytes that the script takes up.
-====
-
-=== Question 6
-
-Perform the following operations:
-
-- Navigate to your scratch directory.
-- Copy and paste the file: `/class/datamine/data/flights/1987.csv` to your current directory (scratch).
-- Create a new directory called `my_test_dir` in your scratch folder.
-- Move the file you copied to your scratch directory, into your new folder.
-- Use `touch` to create an empty file named `im_empty.txt` in your scratch folder.
-- Remove the directory `my_test_dir` _and_ the contents of the directory.
-- Remove the `im_empty.txt` file.
-
-[TIP]
-====
-`rmdir` may not be able to do what you think, instead, check out the options for `rm` using `man rm`.
-====
-
-.Items to submit
-====
-- Command used to navigate to your scratch directory.
-- Command used to copy the file, `/class/datamine/data/flights/1987.csv` to your current directory (scratch).
-- Command used to create a new directory called `my_test_dir` in your scratch folder.
-- Command used to move the file you copied earlier `1987.csv` into your new `my_test_dir` folder.
-- Command used to create an empty file named `im_empty.txt` in your scratch folder.
-- Command used to remove the directory _and_ the contents of the directory `my_test_dir`.
-- Command used to remove the `im_empty.txt` file.
-====
-
-=== Question 7
-
-Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at datamine@purdue.edu for an alternative plan.
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project04.adoc
deleted file mode 100644
index 5b163c8f7..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project04.adoc
+++ /dev/null
@@ -1,174 +0,0 @@
-= STAT 29000: Project 4 -- Fall 2020
-
-**Motivation:** The need to search files and datasets based on the text held within is common during various parts of the data wrangling process. `grep` is an extremely powerful UNIX tool that allows you to do so using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/[even professionals can make critical mistakes]. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in.
-
-**Context:** We've just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, `grep`, and experiment with regular expressions using `grep`, R, and later on, Python.
-
-**Scope:** grep, regular expression basics, utilizing regular expression tools in R and Python
-
-.Learning objectives
-****
-- Use `grep` to search for patterns within a dataset.
-- Use `cut` to section off and slice up data from the command line.
-- Use `wc` to count the number of lines of input.
-****
-
-You can find useful examples that walk you through relevant material in The Examples Book:
-
-https://thedatamine.github.io/the-examples-book
-
-It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
-
-[IMPORTANT]
-====
-I would highly recommend using single quotes `'` to surround your regular expressions. Double quotes can have unexpected behavior due to some shell's expansion rules. In addition, pay close attention to #faq-escape-characters[escaping] certain https://unix.stackexchange.com/questions/20804/in-a-regular-expression-which-characters-need-escaping[characters] in your regular expressions.
-====
-
-== Dataset
-
-The following questions will use the dataset `the_office_dialogue.csv` found in Scholar under the data directory `/class/datamine/data/`. A public sample of the data can be found here: https://www.datadepot.rcac.purdue.edu/datamine/data/movies-and-tv/the_office_dialogue.csv[the_office_dialogue.csv]
-
-Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.
-
-`grep` stands for _**g**lobally_ search for a _**r**egular_ _**e**xpression_ and _**p**rint_ matching lines. As such, to best demonstrate `grep`, we will be using it with textual data. You can read about and see examples of `grep` https://thedatamine.github.io/the-examples-book/unix.html#grep[here].
-
-== Questions
-
-=== Question 1
-
-Login to Scholar and use `grep` to find the dataset we will use this project. The dataset we will use is the only dataset to have the text "Bears. Beets. Battlestar Galactica." Where is it located exactly?
-
-.Items to submit
-====
-- The `grep` command used to find the dataset.
-- The name and location in Scholar of the dataset.
-====
-
-=== Question 2
-
-`grep` prints the line that the text you are searching for appears in. In project 3 we learned a UNIX command to quickly print the first _n_ lines from a file. Use this command to get the headers for the dataset. As you can see, each line in the tv show is a row in the dataset. You can count to see which column the various bits of data live in.
-
-Write a line of UNIX commands that searches for "bears. beets. battlestar galactica." and, rather than printing the entire line, prints only the character who speaks the line, as well as the line itself.
-
-[TIP]
-====
-The result if you were to search for "bears. beets. battlestar galactica." should be:
-
-----
-"Jim","Fact. Bears eat beets. Bears. Beets. Battlestar Galactica."
-----
-====
-
-[TIP]
-====
-One method to solve this problem would be to https://thedatamine.github.io/the-examples-book/unix.html#piping-and-redirection[pipe]
-the output from `grep` to https://thedatamine.github.io/the-examples-book/unix.html#cut[cut].
-====
-
-.Items to submit
-====
-- The line of UNIX commands used to find the character and original dialogue line that contains "bears. beets. battlestar galactica.".
-====
-
-=== Question 3
-
-This particular dataset happens to be very small. You could imagine a scenario where the file is many gigabytes and not easy to load completely into R or Python. We are interested in learning what makes Jim and Pam tick as a couple. Use a line of UNIX commands to create a new dataset called `jim_and_pam.csv` (remember, a good place to store data temporarily is `/scratch/scholar/$USER`). Include only lines that are spoken by either Jim or Pam, or reference Jim or Pam in any way. How many rows of data are in the new file? How many megabytes is the new file (to the nearest 1/10th of a megabyte)?
-
-[TIP]
-====
-https://thedatamine.github.io/the-examples-book/unix.html#piping-and-redirection[Redirection].
-====
-
-[TIP]
-====
-It is OK if you get an erroneous line where the word "jim" or "pam" appears as a part of another word.
-====
-
-.Items to submit
-====
-- The line of UNIX commands used to create the new file.
-- The number of rows of data in the new file, and the accompanying UNIX command used to find this out.
-- The number of megabytes (to the nearest 1/10th of a megabyte) that the new file has, and the accompanying UNIX command used to find this out.
-====
-
-=== Question 4
-
-Find all lines where either Jim/Pam/Michael/Dwight's name is followed by an exclamation mark. Use only 1 "!" within your regular expression. How many lines are there? Ignore case (whether or not parts of the names are capitalized or not).
-
-.Items to submit
-====
-- The UNIX command(s) used to solve this problem.
-- The number of lines where either Jim/Pam/Michael/Dwight's name is followed by an exclamation mark.
-====
-
-=== Question 5
-
-Find all lines that contain the text "that's what" followed by any amount of any text and then "said". How many lines are there?
-
-.Items to submit
-====
-- The UNIX command used to solve this problem.
-- The number of lines that contain the text "that's what" followed by any amount of text and then "said".
-====
-
-Regular expressions are really a useful semi language-agnostic tool. What this means is regardless of the programming language your are using, there will be some package that allows you to use regular expressions. In fact, we can use them in both R and Python! This can be particularly useful when dealing with strings. Load up the dataset you discovered in (1) using `read.csv`. Name the resulting data.frame `dat`.
-
-=== Question 6
-
-The `text_w_direction` column in `dat` contains the characters' lines with inserted direction that helps characters know what to do as they are reciting the lines. Direction is shown between square brackets "[" "]". In this two-part question, we are going to use regular expression to detect the directions.
-
-(a) Create a new column called `has_direction` that is set to `TRUE` if the `text_w_direction` column has direction, and `FALSE` otherwise. Use the `grepl` function in R to accomplish this.
-
-[TIP]
-====
-Make sure all opening brackets "[" have a corresponding closing bracket "]".
-====
-
-[TIP]
-====
-Think of the pattern as any line that has a [, followed by any amount of any text, followed by a ], followed by any amount of any text.
-====
-
-(b) Modify your regular expression to find lines with 2 or more sets of direction. How many lines have more than 2 directions? Modify your code again and find how many have more than 5.
-
-We count the sets of direction in each line by the pairs of square brackets. The following are two simple example sentences.
-
-----
-This is a line with [emphasize this] only 1 direction!
-This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].
-----
-
-Your solution to part (a) should find both lines a match. However, in part (b) we want the regular expression pattern to find only lines with 2+ directions, so the first line would not be a match.
-
-In our actual dataset, for example, `dat$text_w_direction[2789]` is a line with 2 directions.
-
-.Items to submit
-====
-- The R code and regular expression used to solve the first part of this problem.
-- The R code and regular expression used to solve the second part of this problem.
-- How many lines have >= 2 directions?
-- How many lines have >= 5 directions?
-====
-
-=== OPTIONAL QUESTION
-
-Use the `str_extract_all` function from the `stringr` package to extract the direction(s) as well as the text between direction(s) from each line. Put the strings in a new column called `direction`.
-
-----
-This is a line with [emphasize this] only 1 direction!
-This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].
-----
-
-In this question, your solution may have extracted:
-
-----
-[emphasize this]
-[emphasize this] 2 sets of direction, do you see the difference [shrug]
-----
-
-(It is okay to keep the text between neighboring pairs of "[" and "]" for the second line.)
-
-.Items to submit
-====
-- The R code used to solve this problem.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project05.adoc
deleted file mode 100644
index adce59862..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project05.adoc
+++ /dev/null
@@ -1,159 +0,0 @@
-= STAT 29000: Project 5 -- Fall 2020
-
-**Motivation:** Becoming comfortable stringing together commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc.
-
-**Context:** We've been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping.
-
-**Scope:** grep, regular expression basics, UNIX utilities, redirection, piping
-
-.Learning objectives
-****
-- Use `cut` to section off and slice up data from the command line.
-- Use piping to string UNIX commands together.
-- Use `sort` and it's options to sort data in different ways.
-- Use `head` to isolate _n_ lines of output.
-- Use `wc` to summarize the number of lines in a file or in output.
-- Use `uniq` to filter out non-unique lines.
-- Use `grep` to search files effectively.
-****
-
-You can find useful examples that walk you through relevant material in The Examples Book:
-
-https://thedatamine.github.io/the-examples-book
-
-It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
-
-Don't forget the very useful documentation shortcut `?` for R code. To use, simply type `?` in the console, followed by the name of the function you are interested in. In the Terminal, you can use the `man` command to check the documentation of `bash` code.
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/amazon/amazon_fine_food_reviews.csv`
-
-A public sample of the data can be found here: https://www.datadepot.rcac.purdue.edu/datamine/data/amazon/amazon_fine_food_reviews.csv[amazon_fine_food_reviews.csv]
-
-Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.
-
-Here are three videos that might also be useful, as you work on Project 5:
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-== Questions
-
-=== Question 1
-
-What is the `Id` of the most helpful review, according to the highest `HelpfulnessNumerator`?
-
-[IMPORTANT]
-====
-You can always pipe output to `head` in case you want the first few values of a lot of output. Note that if you used `sort` before `head`, you may see the following error messages:
-
-----
-sort: write failed: standard output: Broken pipe
-sort: write error
-----
-
-This is because `head` would truncate the output from `sort`. This is okay. See https://stackoverflow.com/questions/46202653/bash-error-in-sort-sort-write-failed-standard-output-broken-pipe[this discussion] for more details.
-====
-
-.Items to submit
-====
-- Line of UNIX commands used to solve the problem.
-- The `Id` of the most helpful review.
-====
-
-=== Question 2
-
-Some entries under the `Summary` column appear more than once. Calculate the proportion of unique summaries over the total number of summaries. Use two lines of UNIX commands to find the numerator and the denominator, and manually calculate the proportion.
-
-To further clarify what we mean by _unique_, if we had the following vector in R, `c("a", "b", "a", "c")`, its unique values are `c("a", "b", "c")`.
-
-.Items to submit
-====
-- Two lines of UNIX commands used to solve the problem.
-- The ratio of unique `Summary`'s.
-====
-
-=== Question 3
-
-Use a chain of UNIX commands, piped in a sequence, to create a frequency table of `Score`.
-
-.Items to submit
-====
-- The line of UNIX commands used to solve the problem.
-- The frequency table.
-====
-
-=== Question 4
-
-Who is the user with the highest number of reviews? There are two columns you could use to answer this question, but which column do you think would be most appropriate and why?
-
-[TIP]
-====
-You may need to pipe the output to `sort` multiple times.
-====
-
-[TIP]
-====
-To create the frequency table, read through the `man` pages for `uniq`. Man pages are the "manual" pages for UNIX commands. You can read through the man pages for uniq by running the following:
-
-[source,bash]
-----
-man uniq
-----
-====
-
-.Items to submit
-====
-- The line of UNIX commands used to solve the problem.
-- The frequency table.
-====
-
-=== Question 5
-
-Anecdotally, there seems to be a tendency to leave reviews when we feel strongly (either positive or negative) about a product. For the user with the highest number of reviews (i.e., the user identified in question 4), would you say that they follow this pattern of extremes? Let's consider 5 star reviews to be strongly positive and 1 star reviews to be strongly negative. Let's consider anything in between neither strongly positive nor negative.
-
-[TIP]
-====
-You may find the solution to problem (3) useful.
-====
-
-.Items to submit
-====
-- The line of UNIX commands used to solve the problem.
-====
-
-=== Question 6
-
-Find the most helpful review with a `Score` of 5. Then (separately) find the most helpful review with a `Score` of 1. As before, we are considering the most helpful review to be the review with the highest `HelpfulnessNumerator`.
-
-[TIP]
-====
-You can use multiple lines to solve this problem.
-====
-
-.Items to submit
-====
-- The lines of UNIX commands used to solve the problem.
-- `ProductId`'s of both requested reviews.
-====
-
-=== OPTIONAL QUESTION
-
-For **only** the two `ProductId`s from the previous question, create a new dataset called `scores.csv` that contains the `ProductId`s and `Score`s from all reviews for these two items.
-
-.Items to submit
-====
-- The line of UNIX commands used to solve the problem.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project06.adoc
deleted file mode 100644
index 2239a1f19..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project06.adoc
+++ /dev/null
@@ -1,211 +0,0 @@
-= STAT 29000: Project 6 -- Fall 2020
-
-**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
-
-**Context:** This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
-
-**Scope:** awk, UNIX utilities, bash scripts
-
-.Learning objectives
-****
-- Use `awk` to process and manipulate textual data.
-- Use piping and redirection within the terminal to pass around data between utilities.
-****
-
-== Dataset
-
-The following questions will use the dataset found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/YYYY.csv[here] or in Scholar:
-
-`/class/datamine/data/flights/subset/YYYY.csv`
-
-An example from 1987 data can be found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/1987.csv[here] or in Scholar:
-
-`/class/datamine/data/flights/subset/1987.csv`
-
-== Questions
-
-[IMPORTANT]
-====
-Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
-====
-
-[IMPORTANT]
-====
-Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
-====
-
-
-=== Question 1
-
-In previous projects we learned how to get a single column of data from a csv file. Write 1 line of UNIX commands to print the 17th column, the `Origin`, from `1987.csv`. Write another line, this time using `awk` to do the same thing. Which one do you prefer, and why?
-
-Here is an example, from a different data set, to illustrate some differences and similarities between cut and awk:
-
-++++
-
-++++
-
-.Items to submit
-====
-- One line of UNIX commands to solve the problem *without* using `awk`.
-- One line of UNIX commands to solve the problem using `awk`.
-- 1-2 sentences describing which method you prefer and why.
-====
-
-=== Question 2
-
-Write a bash script that accepts a year (1987, 1988, etc.) and a column *n* and returns the *nth* column of the associated year of data.
-
-Here are two examples to illustrate how to write a bash script:
-
-++++
-
-++++
-
-++++
-
-++++
-
-[TIP]
-====
-In this example, you only need to turn in the content of your bash script (starting with `#!/bin/bash`) without evaluation in a code chunk. However, you should test your script before submission to make sure it works. To actually test out your bash script, take the following example. The script is simple and just prints out the first two arguments given to it:
-
-[source,bash]
-----
-#!/bin/bash
-echo "First argument: $1"
-echo "Second argument: $2"
-----
-====
-
-If you simply drop that text into a file called `my_script.sh`, located here: `/home/$USER/my_script.sh`, and if you run the following:
-
-[source,bash]
-----
-# Setup bash to run; this only needs to be run one time per session.
-# It makes bash behave a little more naturally in RStudio.
-exec bash
-# Navigate to the location of my_script.sh
-cd /home/$USER
-# Make sure that the script is runable.
-# This only needs to be done one time for each new script that you write.
-chmod 755 my_script.sh
-# Execute my_script.sh
-./my_script.sh okay cool
-----
-
-then it will print:
-
-----
-First argument: okay
-Second argument: cool
-----
-
-In this example, if we were to turn in the "content of your bash script (starting with `#!/bin/bash`) in a code chunk, our solution would look like this:
-
-[source,bash]
-----
-#!/bin/bash
-echo "First argument: $1"
-echo "Second argument: $2"
-----
-
-And although we aren't running the code chunk above, we know that it works because we tested it in the terminal.
-
-[TIP]
-====
-Using `awk` you could have a script with just two lines: 1 with the "hash-bang" (`#!/bin/bash`), and 1 with a single `awk` command.
-====
-
-.Items to submit
-====
-- The content of your bash script (starting with `#!/bin/bash`) in a code chunk.
-====
-
-=== Question 3
-
-How many flights arrived at Indianapolis (IND) in 2008? First solve this problem without using `awk`, then solve this problem using *only* `awk`.
-
-Here is a similar example, using the election data set:
-
-++++
-
-++++
-
-.Items to submit
-====
-- One line of UNIX commands to solve the problem *without* using `awk`.
-- One line of UNIX commands to solve the problem using `awk`.
-- The number of flights that arrived at Indianapolis (IND) in 2008.
-====
-
-=== Question 4
-
-Do you expect the number of unique origins and destinations to be the same based on flight data in the year 2008? Find out, using any command line tool you'd like. Are they indeed the same? How many unique values do we have per category (`Origin`, `Dest`)?
-
-Here is an example to help you with the last part of the question, about Origin-to-Destination pairs. We analyze the city-state pairs from the election data:
-
-++++
-
-++++
-
-.Items to submit
-====
-- 1-2 sentences explaining whether or not you expect the number of unique origins and destinations to be the same.
-- The UNIX command(s) used to figure out if the number of unique origins and destinations are the same.
-- The number of unique values per category (`Origin`, `Dest`).
-====
-
-=== Question 5
-
-In (4) we found that there are not the same number of unique `Origin` as `Dest`. Find the https://en.wikipedia.org/wiki/International_Air_Transport_Association_code#Airport_codes[IATA airport code] for all `Origin` that don't appear in a `Dest` and all `Dest` that don't appear in an `Origin` in the 2008 data.
-
-[TIP]
-====
-The examples on https://www.tutorialspoint.com/unix_commands/comm.htm[this page] should help. Note that these examples are based on https://tldp.org/LDP/abs/html/process-sub.html[Process Substitution] , which basically allows you to specify commands whose output would be used as the input of `comm`. There should be no space between the open bracket and open parenthesis, otherwise your bash will not work as intended.
-====
-
-.Items to submit
-====
-- The line(s) of UNIX command(s) used to answer the question.
-- The list of all `Origin` that don't appear in `Dest`.
-- The list of all `Dest` that don't appear in `Origin`.
-====
-
-=== Question 6
-
-What was the percentage of flights in 2008 per unique `Origin` with the `Dest` of "IND"? What percentage of flights had "PHX" as `Origin` (among all flights with `Dest` of "IND")?
-
-Here is an example using the percentages of donations contributed from CEOs from various States:
-
-++++
-
-++++
-
-[TIP]
-====
-You can do the mean calculation in awk by dividing the result from (3) by the number of unique `Origin` that have a `Dest` of "IND".
-====
-
-.Items to submit
-====
-- The percentage of flights in 2008 per unique `Origin` with the `Dest` of "IND".
-- 1-2 sentences explaining how "PHX" compares (as a unique `ORIGIN`) to the other `Origin` (all with the `Dest` of "IND")?
-====
-
-=== OPTIONAL QUESTION
-
-Write a bash script that takes a year and IATA airport code and returns the year, and the total number of flights to and from the given airport. Example rows may look like:
-
-----
-1987, 12345
-1988, 44
-----
-
-Run the script with inputs: `1991` and `ORD`. Include the output in your submission.
-
-.Items to submit
-====
-- The content of your bash script (starting with "#!/bin/bash") in a code chunk.
-- The output of the script given `1991` and `ORD` as inputs.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project07.adoc
deleted file mode 100644
index 889bf4a7e..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project07.adoc
+++ /dev/null
@@ -1,138 +0,0 @@
-= STAT 29000: Project 7 -- Fall 2020
-
-**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
-
-**Context:** This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
-
-**Scope:** awk, UNIX utilities, bash scripts
-
-.Learning objectives
-****
-- Use `awk` to process and manipulate textual data.
-- Use piping and redirection within the terminal to pass around data between utilities.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-`/class/datamine/data/flights/subset/YYYY.csv`
-
-An example of the data for the year 1987 can be found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/1987.csv[here].
-
-Sometimes if you are about to dig into a dataset, it is good to quickly do some sanity checks early on to make sure the data is what you expect it to be.
-
-== Questions
-
-[IMPORTANT]
-====
-Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
-====
-
-[IMPORTANT]
-====
-Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
-====
-
-
-=== Question 1
-
-Write a line of code that prints a list of the unique values in the `DayOfWeek` column. Write a line of code that prints a list of the unique values in the `DayOfMonth` column. Write a line of code that prints a list of the unique values in the `Month` column. Use the `1987.csv` dataset. Are the results what you expected?
-
-.Items to submit
-====
-- 3 lines of code used to get a list of unique values for the chosen columns.
-- 1-2 sentences explaining whether or not the results are what you expected.
-====
-
-=== Question 2
-
-Our files should have 29 columns. For a given file, write a line of code that prints any lines that do *not* have 29 columns. Test it on `1987.csv`, were there any rows without 29 columns?
-
-[TIP]
-====
-See [here](#unix-awk-built-in-variables). `NF` looks like it may be useful!
-====
-
-.Items to submit
-====
-- Line of code used to solve the problem.
-- 1-2 sentences explaining whether or not there were any rows without 29 columns.
-====
-
-=== Question 3
-
-Write a bash script that, given a "begin" year and "end" year, cycles through the associated files and prints any lines that do *not* have 29 columns.
-
-.Items to submit
-====
-- The content of your bash script (starting with "#!/bin/bash") in a code chunk.
-- The results of running your bash scripts from year 1987 to 2008.
-====
-
-=== Question 4
-
-`awk` is a really good tool to quickly get some data and manipulate it a little bit. The column `Distance` contains the distances of the flights in miles. Use `awk` to calculate the total distance traveled by the flights in 1990, and show the results in both miles and kilometers. To convert from miles to kilometers, simply multiply by 1.609344.
-
-The following is an output example:
-
-----
-Miles: 12345
-Kilometers: 19867.35168
-----
-
-.Items to submit
-====
-- The code used to solve the problem.
-- The results of running the code.
-====
-
-=== Question 5
-
-Use `awk` to calculate the sum of the number of `DepDelay` minutes, grouped according to `DayOfWeek`. Use `2007.csv`.
-
-The following is an output example:
-
-----
-DayOfWeek: 0
-1: 1234567
-2: 1234567
-3: 1234567
-4: 1234567
-5: 1234567
-6: 1234567
-7: 1234567
-----
-
-[NOTE]
-====
-1 is Monday.
-====
-
-.Items to submit
-====
-- The code used to solve the problem.
-- The output from running the code.
-====
-
-=== Question 6
-
-It wouldn't be fair to compare the total `DepDelay` minutes by `DayOfWeek` as the number of flights may vary. One way to take this into account is to instead calculate an average. Modify (5) to calculate the average number of `DepDelay` minutes by the number of flights per `DayOfWeek`. Use `2007.csv`.
-
-The following is an output example:
-
-----
-DayOfWeek: 0
-1: 1.234567
-2: 1.234567
-3: 1.234567
-4: 1.234567
-5: 1.234567
-6: 1.234567
-7: 1.234567
-----
-
-.Items to submit
-====
-- The code used to solve the problem.
-- The output from running the code.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project08.adoc
deleted file mode 100644
index f3821db8a..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project08.adoc
+++ /dev/null
@@ -1,144 +0,0 @@
-= STAT 29000: Project 8 -- Fall 2020
-
-**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
-
-**Context:** This is the last part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
-
-**Scope:** awk, UNIX utilities, bash scripts
-
-.Learning objectives
-****
-- Use `awk` to process and manipulate textual data.
-- Use piping and redirection within the terminal to pass around data between utilities.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-`/class/datamine/data/flights/subset/YYYY.csv`
-
-An example of the data for the year 1987 can be found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/1987.csv[here].
-
-== Questions
-
-[IMPORTANT]
-====
-Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
-====
-
-[IMPORTANT]
-====
-Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
-====
-
-=== Question 1
-
-Let's say we have a theory that there are more flights on the weekend days (Friday, Saturday, Sunday) than the rest of the days, on average. We can use awk to quickly check it out and see if maybe this looks like something that is true!
-
-Write a line of `awk` code that, prints the _total number of flights_ that occur on weekend days, followed by the _total number of flights_ that occur on the weekdays. Complete this calculation for 2008 using the `2008.csv` file.
-
-[NOTE]
-====
-Under the column `DayOfWeek`, Monday through Sunday are represented by 1-7, respectively.
-====
-
-.Items to submit
-====
-- Line of `awk` code that solves the problem.
-- The result: the number of flights on the weekend days, followed by the number of flights on the weekdays for the flights during 2008.
-====
-
-=== Question 2
-
-Note that in (1), we are comparing 3 days to 4! Write a line of `awk` code that, prints the average number of flights on a weekend day, followed by the average number of flights on the weekdays. Continue to use data for 2008.
-
-[TIP]
-====
-You don't need a large if statement to do this, you can use the `~` comparison operator.
-====
-
-.Items to submit
-====
-- Line of `awk` code that solves the problem.
-- The result: the average number of flights on the weekend days, followed by the average number of flights on the weekdays for the flights during 2008.
-====
-
-=== Question 3
-
-We want to look to see if there may be some truth to the whole "snow bird" concept where people will travel to warmer states like Florida and Arizona during the Winter. Let's use the tools we've learned to explore this a little bit.
-
-Take a look at `airports.csv`. In particular run the following:
-
-[source,bash]
-----
-head airports.csv
-----
-
-Notice how all of the non-numeric text is surrounded by quotes. The surrounding quotes would need to be escaped for any comparison within `awk`. This is messy and we would prefer to create a new file called `new_airports.csv` without any quotes. Write a line of code to do this.
-
-[NOTE]
-====
-You may be wondering *why* we are asking you to do this. This sort of situation (where you need to deal with quotes) happens a lot! It's important to practice and learn ways to fix these things.
-====
-
-[TIP]
-====
-You could use `gsub` within `awk` to replace '"' with ''. You can find how to use `gsub` https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html[here].
-====
-
-[TIP]
-====
-If you leave out the column number argument to `gsub` it will apply the substitution to every field in every column.
-====
-
-[TIP]
-====
-[source,bash]
-----
-cat new_airports.csv | wc -l
-# should be 159 without header
-----
-====
-
-.Items to submit
-====
-- Line of `awk` code used to create the new dataset.
-====
-
-=== Question 4
-
-Write a line of commands that creates a new dataset called `az_fl_airports.txt`. `az_fl_airports.txt` should _only_ contain a list of airport codes for all airports from both Arizona (AZ) and Florida (FL). Use the file we created in (3), `new_airports.csv` as a starting point.
-
-How many airports are there? Did you expect this? Use a line of bash code to count this.
-
-Create a new dataset called `az_fl_flights.txt` that contains all of the data for flights into or out of Florida and Arizona using the `2008.csv` file. Use the newly created dataset, `az_fl_airports.txt`, to accomplish this.
-
-[TIP]
-====
-https://unix.stackexchange.com/questions/293684/basic-grep-awk-help-extracting-all-lines-containing-a-list-of-terms-from-one-f
-====
-
-[TIP]
-====
-[source,bash]
-----
-cat az_fl_flights.txt | wc -l # should be 484705
-----
-====
-
-.Items to submit
-====
-- All UNIX commands used to answer the questions.
-- The number of airports.
-- 1-2 sentences explaining whether you expected this number of airports.
-====
-
-=== Question 5
-
-Write a bash script that accepts the year as an argument and performs the same operations as in question 4, returning the number of flights into and out of both AZ and FL for any given year.
-
-.Items to submit
-====
-- The content of your bash script (starting with "#!/bin/bash") in a code chunk.
-- The line of UNIX code you used to execute the script and create the new dataset.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project09.adoc
deleted file mode 100644
index 62aa8ae4e..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project09.adoc
+++ /dev/null
@@ -1,235 +0,0 @@
-= STAT 29000: Project 9 -- Fall 2020
-
-**Motivation:** Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. In fact, https://www.cloudflare.com/[cloudflare], a billion dollar company, had much of its starting infrastructure built on top of a Postgresql database (per https://news.ycombinator.com/item?id=22878136[this thread on hackernews]). Learning SQL is _well_ worth your time!
-
-**Context:** There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we've spent much of this semester in the terminal, we will start in the terminal using SQLite.
-
-**Scope:** SQL, sqlite
-
-.Learning objectives
-****
-- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
-- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
-- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/lahman/lahman.db`
-
-This is the Lahman Baseball Database. You can find its documentation http://www.seanlahman.com/files/database/readme2017.txt[here], including the definitions of the tables and columns.
-
-== Questions
-
-[IMPORTANT]
-====
-Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
-====
-
-[IMPORTANT]
-====
-Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
-====
-
-[IMPORTANT]
-====
-For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd
-====
-
-````markdown
-```{r, include=F}`r ''`
-library(RSQLite)
-lahman <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/lahman/lahman.db")
-```
-````
-
-Each solution should then be placed in a code chunk like this:
-
-````markdown
-```{sql, connection=lahman}`r ''`
-SELECT * FROM batting LIMIT 1;
-```
-````
-
-If you want to use a SQLite-specific function like `.tables` (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:
-
-[source,bash]
-----
-sqlite3 /class/datamine/data/lahman/lahman.db
-----
-
-From there, the SQLite-specific commands will function properly. They will _not_ function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option `eval=F` like this:
-
-````markdown
-```{sql, connection=lahman, eval=F}`r ''`
-SELECT * FROM batting LIMIT 1;
-```
-````
-
-This will allow the code to be displayed without throwing an error.
-
-=== Question 1
-
-Connect to RStudio Server https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal and access the Lahman database. How many tables are available?
-
-[TIP]
-====
-To connect to the database, do the following:
-====
-
-```{bash, eval=F}
-sqlite3 /class/datamine/data/lahman/lahman.db
-```
-
-[TIP]
-====
-https://database.guide/2-ways-to-list-tables-in-sqlite-database/[This] is a good resource.
-====
-
-.Items to submit
-====
-- How many tables are available in the Lahman database?
-- The sqlite3 commands used to figure out how many tables are available.
-====
-
-=== Question 2
-
-Some people like to try to https://www.washingtonpost.com/graphics/2017/sports/how-many-mlb-parks-have-you-visited/[visit all 30 MLB ballparks] in their lifetime. Use SQL commands to get a list of `parks` and the cities they're located in. For your final answer, limit the output to 10 records/rows.
-
-[NOTE]
-====
-There may be more than 30 parks in your result, this is ok. For long results, you can limit the number of printed results using the `LIMIT` clause.
-====
-
-[TIP]
-====
-Make sure you take a look at the column names and get familiar with the data tables. If working from the Terminal, to see the header row as a part of each query result, run the following:
-
-[source,SQL]
-----
-.headers on
-----
-====
-
-.Items to submit
-====
-- SQL code used to solve the problem.
-- The first 10 results of the query.
-====
-
-=== Question 3
-
-There is nothing more exciting to witness than a home run hit by a batter. It's impressive if a player hits more than 40 in a season. Find the hitters who have hit 60 or more home runs (`HR`) in a season. List their `playerID`, `yearID`, home run total, and the `teamID` they played for.
-
-[TIP]
-====
-There are 8 occurrences of home runs greater than or equal to 60.
-====
-
-[TIP]
-====
-The `batting` table is where you should look for this question.
-====
-
-.Items to submit
-====
-- SQL code used to solve the problem.
-- The first 10 results of the query.
-====
-
-=== Question 4
-
-Make a list of players born on your birth day (don't worry about the year). Display their first names, last names, and birth year. Order the list descending by their birth year.
-
-[TIP]
-====
-The `people` table is where you should look for this question.
-====
-
-[NOTE]
-====
-Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here].
-====
-
-.Items to submit
-====
-- SQL code used to solve the problem.
-- The first 10 results of the query.
-====
-
-=== Question 5
-
-Get the Cleveland (CLE) Pitching Roster from the 2016 season (`playerID`, `W`, `L`, `SO`). Order the pitchers by number of Strikeouts (SO) in descending order.
-
-[TIP]
-====
-The `pitching` table is where you should look for this question.
-====
-
-[NOTE]
-====
-Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here].
-====
-
-.Items to submit
-====
-- SQL code used to solve the problem.
-- The first 10 results of the query.
-====
-
-=== Question 6
-
-Find the 10 team and year pairs that have the most number of Errors (`E`) between 1960 and 1970. Display their Win and Loss counts too. What is the name of the team that appears in 3rd place in the ranking of the team and year pairs?
-
-[TIP]
-====
-The `teams` table is where you should look for this question.
-====
-
-[TIP]
-====
-The `BETWEEN` clause is useful here.
-====
-
-[TIP]
-====
-It is OK to use multiple queries to answer the question.
-====
-
-[NOTE]
-====
-Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here].
-====
-
-.Items to submit
-====
-- SQL code used to solve the problem.
-- The first 10 results of the query.
-====
-
-=== Question 7
-
-Find the `playerID` for Bob Lemon. What year and team was he on when he got the most wins as a pitcher (use table `pitching`)? What year and team did he win the most games as a manager (use table `managers`)?
-
-[TIP]
-====
-It is OK to use multiple queries to answer the question.
-====
-
-[NOTE]
-====
-There was a tie among the two years in which Bob Lemon had the most wins as a pitcher.
-====
-
-[NOTE]
-====
-Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here].
-====
-
-.Items to submit
-====
-- SQL code used to solve the problem.
-- The first 10 results of the query.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project10.adoc
deleted file mode 100644
index 45cb31d19..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project10.adoc
+++ /dev/null
@@ -1,210 +0,0 @@
-= STAT 29000: Project 10 -- Fall 2020
-
-**Motivation:** Although SQL syntax may still feel unnatural and foreign, with more practice it _will_ start to make more sense. The ability to read and write SQL queries is a bread-and-butter skill for anyone working with data.
-
-**Context:** We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like `AVG`, `MIN`, and `MAX`.
-
-**Scope:** SQL, sqlite
-
-.Learning objectives
-****
-- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
-- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
-- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.
-- Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems.
-****
-
-== Dataset
-
-The following questions will use the dataset similar to the one from Project 9, but this time we will use a MariaDB version of the database, which is also hosted on Scholar, at `scholar-db.rcac.purdue.edu`.
-As in Project 9, this is the Lahman Baseball Database. You can find its documentation http://www.seanlahman.com/files/database/readme2017.txt[here], including the definitions of the tables and columns.
-
-== Questions
-
-[IMPORTANT]
-====
-Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
-====
-
-[IMPORTANT]
-====
-Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
-====
-
-[IMPORTANT]
-====
-For this project all solutions should be done using R code chunks, and the `RMariaDB` package. Run the following code to load the library:
-
-[source,r]
-----
-library(RMariaDB)
-----
-====
-
-=== Question 1
-
-Connect to RStudio Server https://rstudio.scholar.rcac.purdue.edu, and, rather than navigating to the terminal like we did in the previous project, instead, create a connection to our MariaDB lahman database using the `RMariaDB` package in R, and the credentials below. Confirm the connection by running the following code chunk:
-
-[source,r]
-----
-con <- dbConnect(RMariaDB::MariaDB(),
- host="scholar-db.rcac.purdue.edu",
- db="lahmandb",
- user="lahman_user",
- password="HitAH0merun")
-head(dbGetQuery(con, "SHOW tables;"))
-----
-
-[TIP]
-====
-In the example provided, the variable `con` from the `dbConnect` function is the connection. Each query that you make, using the `dbGetQuery`, needs to use this connection `con`. You can change the name `con` if you want to (it is user defined), but if you change the name `con`, you need to change it on all of your connections. If your connection to the database dies while you are working on the project, you can always re-run the `dbConnect` line again, to reset your connection to the database.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- Output from running your (potentially modified) `head(dbGetQuery(con, "SHOW tables;"))`.
-====
-
-=== Question 2
-
-How many players are members of the 40/40 club? These are players that have stolen at least 40 bases (`SB`) and hit at least 40 home runs (`HR`) in one year.
-
-[TIP]
-====
-Use the `batting` table.
-====
-
-[IMPORTANT]
-====
-You only need to run `library(RMariaDB)` and the `dbConnect` portion of the code a single time towards the top of your project. After that, you can simply reuse your connection `con` to run queries.
-====
-
-[IMPORTANT]
-====
-In our xref:templates.adoc[project template], for this project, make all of the SQL queries using the `dbGetQuery` function, which returns the results directly in `R`. Therefore, your `RMarkdown` blocks for this project should all be `{r}` blocks (as opposed to the `{sql}` blocks used in Project 9).
-====
-
-[TIP]
-====
-You can use `dbGetQuery` to run your queries from within R. Example:
-
-[source,r]
-----
-dbGetQuery(con, "SELECT * FROM battings LIMIT 5;")
-----
-====
-
-[NOTE]
-====
-We already demonstrated the correct SQL query to use for the 40/40 club in the video below, but now we want you to use `RMariaDB` to solve this query.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result of running the R code.
-====
-
-=== Question 3
-
-Find Corey Kluber's lifetime across his career (i.e., use `SUM` from `SQL` to summarize his achievements) in two categories: strikeouts (`SO`) and walks (`BB`). Also display his Strikeouts to Walks ratio. A Strikeout to Walks ratio is calculated by this equation: $\frac{Strikeouts}{Walks}$.
-
-++++
-
-++++
-
-[IMPORTANT]
-====
-Questions in this project need to be solved using SQL when possible. You will not receive credit for a question if you use `sum` in R rather than `SUM` in SQL.
-====
-
-[TIP]
-====
-Use the `people` table to find the `playerID` and use the `pitching` table to find the statistics.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result of running the R code.
-====
-
-=== Question 4
-
-How many times in total has Giancarlo Stanton struck out in years in which he played for "MIA" or "FLO"?
-
-[TIP]
-====
-Use the `people` table to find the `playerID` and use the `batting` table to find the statistics.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result of running the R code.
-====
-
-=== Question 5
-
-The https://en.wikipedia.org/wiki/Batting_average_(baseball)[Batting Average] is a metric for a batter's performance. The Batting Average in a year is calculated by stem:[\frac{H}{AB}] (the number of hits divided by at-bats). Considering (only) the years between 2000 and 2010, calculate the (seasonal) Batting Average for each batter who had more than 300 at-bats in a season. List the top 5 batting averages next to `playerID`, `teamID`, and `yearID.`
-
-[TIP]
-====
-Use the `batting` table.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result of running the R code.
-====
-
-=== Question 6
-
-How many unique players have hit > 50 home runs (`HR`) in a season?
-
-[TIP]
-====
-Use the `batting` table.
-====
-
-[TIP]
-====
-If you view `DISTINCT` as being paired with `SELECT`, instead, think of it as being paired with one of the fields you are selecting.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result of running the R code.
-====
-
-=== Question 7
-
-Find the number of unique players that attended Purdue University. Start by finding the `schoolID` for Purdue and then find the number of players who played there. Do the same for IU. Who had more? Purdue or IU? Use the information you have in the database, and the power of R to create a misleading graphic that makes Purdue look better than IU, even if just at first glance. Make sure you label the graphic.
-
-[TIP]
-====
-Use the `schools` table to get the `schoolID`s, and the `collegeplaying` table to get the statistics.
-====
-
-[TIP]
-====
-You can mess with the scale of the y-axis. You could (potentially) filter the data to start from a certain year or be between two dates.
-====
-
-[TIP]
-====
-To find IU's id, try the following query: `SELECT schoolID FROM schools WHERE name_full LIKE '%indiana%';`. You can find more about the LIKE clause and `%` https://www.tutorialspoint.com/sql/sql-like-clause.htm[here].
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result of running the R code.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project11.adoc
deleted file mode 100644
index 94b4ee5d4..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project11.adoc
+++ /dev/null
@@ -1,215 +0,0 @@
-= STAT 29000: Project 11 -- Fall 2020
-
-**Motivation:** Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like MIN, MAX, and AVG in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values.
-
-**Context:** We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values using a much larger dataset!
-
-**Scope:** SQL, SQL in R
-
-.Learning objectives
-****
-- Demonstrate the ability to interact with popular database management systems within R.
-- Solve data-driven problems using a combination of SQL and R.
-- Basic clauses: SELECT, ORDER BY, LIMIT, DESC, ASC, COUNT, WHERE, FROM, etc.
-- Showcase the ability to filter, alias, and write subqueries.
-- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. Explain when to use having, and when to use where.
-****
-
-== Dataset
-
-The following questions will use the `elections` database. Similar to Project 10, this database is hosted on Scholar. Moreover, Question 1 also involves the following data files found in Scholar:
-
-`/class/datamine/data/election/itcontYYYY.txt` (for example, data for year 1980 would be `/class/datamine/data/election/itcont1980.txt`)
-
-A public sample of the data can be found here:
-
-https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcontYYYY.txt (for example, data for year 1980 would be https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcont1980.txt)
-
-== Questions
-
-[IMPORTANT]
-====
-For this project you will need to connect to the database `elections` using the `RMariaDB` package in R. Include the following code chunk in the beginning of your RMarkdown file:
-
-````markdown
-```{r setup-database-connection}`r ''`
-library(RMariaDB)
-con <- dbConnect(RMariaDB::MariaDB(),
- host="scholar-db.rcac.purdue.edu",
- db="elections",
- user="elections_user",
- password="Dataelect!98")
-```
-````
-====
-
-When a question involves SQL queries in this project, you may use a SQL code chunk (with `{sql}`), or an R code chunk (with `{r}`) and functions like `dbGetQuery` as you did in Project 10. Please refer to Question 5 in the xref:templates.adoc[project template] for examples.
-
-=== Question 1
-
-Approximately how large was the lahman database (use the sqlite database in Scholar: `/class/datamine/data/lahman/lahman.db`)? Use UNIX utilities you've learned about this semester to write a line of code to return the size of that .db file (in MB).
-
-The data we consider in this project are much larger. Use UNIX utilities (bash and awk) to write another line of code that calculates the total amount of data in the elections folder `/class/datamine/data/election/`. How much data (in MB) is there?
-
-The data in that folder has been added to the `elections` database, all aggregated in the `elections` table. Write a SQL query that returns the number of rows of data are in the database. How many rows of data are in the table `elections`?
-
-[NOTE]
-====
-These are some examples of how to get the sizes of collections of files in UNIX:
-====
-
-++++
-
-++++
-
-[TIP]
-====
-The SQL query will take some time! Be patient.
-====
-
-[NOTE]
-====
-You may use more than one code chunk in your RMarkdown file for the different tasks.
-====
-
-[NOTE]
-====
-We will accept values that represent either apparent or allocated size, as well as estimated disk usage. To get the size from `ls` and `du` to match, use the `--apparent-size` option with `du`.
-====
-
-[NOTE]
-====
-A Megabyte (MB) is actually 1000^2 bytes, not 1024^2. A Mebibyte (MiB) is 1024^2 bytes. See https://en.wikipedia.org/wiki/Gigabyte[here] for more information. For this question, either solution will be given full credit. https://thedatamine.github.io/the-examples-book/unix.html#why-is-the-result-of-du--b-.metadata.csv-divided-by-1024-not-the-result-of-du--k-.metadata.csv[This] is a potentially useful example.
-====
-
-.Items to submit
-====
-- Line of code (bash/awk) to show the size (in MB) of the lahman database file.
-- Approximate size of the lahman database in MB.
-- Line of code (bash/awk) to calculate the size (in MB) of the entire elections dataset in `/class/datamine/data/election`.
-- The size of the elections data in MB.
-- SQL query used to find the number of rows of data in the `elections` table in the `elections` database.
-- The number of rows in the `elections` table in the `elections` database.
-====
-
-=== Question 2
-
-Write a SQL query using the `LIKE` command to find a unique list of `zip_code` that start with "479".
-
-Write another SQL query and answer: How many unique `zip_code` are there that begin with "479"?
-
-[NOTE]
-====
-Here are some examples about SQL that might be relevant for Questions 2 and 3 in this project.
-====
-
-++++
-
-++++
-
-[TIP]
-====
-The first query returns a list of zip codes, and the second returns a count.
-====
-
-[TIP]
-====
-Make sure you only select `zip_code`.
-====
-
-.Items to submit
-====
-- SQL queries used to answer the question.
-- The first 5 results from running the query.
-====
-
-=== Question 3
-
-Write a SQL query that counts the number of donations (rows) that are from Indiana. How many donations are from Indiana? Rewrite the query and create an _alias_ for our field so it doesn't read `COUNT(*)` but rather `Indiana Donations`.
-
-[TIP]
-====
-You may enclose an alias's name in quotation marks (single or double) when the name contains space.
-====
-
-.Items to submit
-====
-- SQL query used to answer the question.
-- The result of the SQL query.
-====
-
-=== Question 4
-
-Rewrite the query in (3) so the result is displayed like: `IN: 1234567`. Note, if instead of "IN" we wanted "OH", only the WHERE clause should be modified, and the display should automatically change to `OH: 1234567`. In other words, the state abbreviation should be dynamic, not static.
-
-[NOTE]
-====
-This video demonstrates how to use CONCAT in a MySQL query:
-====
-
-++++
-
-++++
-
-[TIP]
-====
-Use CONCAT and aliasing to accomplish this.
-====
-
-[TIP]
-====
-Remember, `state` contains the state abbreviation.
-====
-
-.Items to submit
-====
-- SQL query used to answer the question.
-====
-
-=== Question 5
-
-In (2) we wrote a query that returns a unique list of zip codes that start with "479". In (3) we wrote a query that counts the number of donations that are from Indiana. Use our query from (2) as a sub-query to find how many donations come from areas with zip codes starting with "479". What percent of donations in Indiana come from said zip codes?
-
-[NOTE]
-====
-This video gives two examples of sub-queries:
-====
-
-++++
-
-++++
-
-[TIP]
-====
-You can simply manually calculate the percent using the count in (2) and (5).
-====
-
-.Items to submit
-====
-- SQL queries used to answer the question.
-- The percentage of donations from Indiana from `zip_code`s starting with "479".
-====
-
-=== Question 6
-
-In (3) we wrote a query that counts the number of donations that are from Indiana. When running queries like this, a natural "next question" is to ask the same question about another state. SQL gives us the ability to calculate functions in aggregate when grouping by a certain column. Write a SQL query that returns the state, number of donations from each state, the sum of the donations (`transaction_amt`). Which 5 states gave the most donations (highest count)? Order you result from most to least.
-
-[NOTE]
-====
-In this video we demonstrate `GROUP BY`, `ORDER BY`, `DESC`, and other aspects of MySQL that might help with this question:
-====
-
-++++
-
-++++
-
-[TIP]
-====
-You may want to create an alias in order to sort.
-====
-
-.Items to submit
-====
-- SQL query used to answer the question.
-- Which 5 states gave the most donations?
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project12.adoc
deleted file mode 100644
index 144c1e421..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project12.adoc
+++ /dev/null
@@ -1,172 +0,0 @@
-= STAT 29000: Project 12 -- Fall 2020
-
-**Motivation:** Databases are comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform joins! In this project we will explore learn about and practice using joins on a database containing bike trip information from the Bay Area Bike Share.
-
-**Context:** We've introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables.
-
-**Scope:** SQL, sqlite, joins
-
-.Learning objectives
-****
-- Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem.
-- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING.
-- Showcase the ability to filter, alias, and write subqueries.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db`
-
-A public sample of the data can be found https://www.datadepot.rcac.purdue.edu/datamine/data/bay_area_bike_share/bay_area_bike_share.db[here].
-
-[IMPORTANT]
-====
-For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd:
-
-````markdown
-```{r, include=F}`r ''`
-library(RSQLite)
-bikeshare <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db")
-```
-````
-
-Each solution should then be placed in a code chunk like this:
-
-````markdown
-```{sql, connection=bikeshare}`r ''`
-SELECT * FROM station LIMIT 5;
-```
-````
-====
-
-If you want to use a SQLite-specific function like `.tables` (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:
-
-```{bash, eval=F}
-sqlite3 /class/datamine/data/bay_area_bike_share/bay_area_bike_share.db
-```
-
-From there, the SQLite-specific commands will function properly. They will _not_ function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option `eval=F` like this:
-
-````markdown
-```{sql, connection=bikeshare, eval=F}`r ''`
-SELECT * FROM station LIMIT 5;
-```
-````
-
-This will allow the code to be displayed without throwing an error.
-
-There are a variety of ways to join data using SQL. With that being said, if you are able to understand and use a LEFT JOIN and INNER JOIN, you can perform *all* of the other types of joins (RIGHT JOIN, FULL OUTER JOIN).
-
-== Questions
-
-=== Question 1
-
-Aliases can be created for tables, fields, and even results of aggregate functions (like MIN, MAX, COUNT, AVG, etc.). In addition, you can combine fields using the `sqlite` concatenate operator `||` see https://www.sqlitetutorial.net/sqlite-string-functions/sqlite-concat/[here]. Write a query that returns the first 5 records of information from the `station` table formatted in the following way:
-
-`(id) name @ (lat, long)`
-
-For example:
-
-`(84) Ryland Park @ (37.342725, -121.895617)`
-
-[TIP]
-====
-Here is a video about how to concatenate strings in SQLite.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- SQL query used to solve this problem.
-- The first 5 records of information from the `station` table.
-====
-
-=== Question 2
-
-There is a variety of interesting weather information in the `weather` table. Write a query that finds the average `mean_temperature_f` by `zip_code`. Which is on average the warmest `zip_code`?
-
-Use aliases to format the result in the following way:
-
-```{txt}
-Zip Code|Avg Temperature
-94041|61.3808219178082
-```
-Note that this is the output if you use `sqlite` in the terminal. While the output in your knitted pdf file may look different, you should name the columns accordingly.
-
-[TIP]
-====
-Here is a video about GROUP BY, ORDER BY, DISTINCT, and COUNT
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- SQL query used to solve this problem.
-- The results of the query copy and pasted.
-====
-
-=== Question 3
-
-From (2) we can see that there are only 5 `zip_code`s with weather information. How many unique `zip_code`s do we have in the `trip` table? Write a query that finds the number of unique `zip_code`s in the `trip` table. Write another query that lists the `zip_code` and count of the number of times the `zip_code` appears. If we had originally assumed that the `zip_code` was related to the location of the trip itself, we were wrong. Can you think of a likely explanation for the unexpected `zip_code` values in the `trip` table?
-
-[TIP]
-====
-There could be missing values in `zip_code`. We want to avoid them in SQL queries, for now. You can learn more about the missing values (or NULL) in SQL https://www.w3schools.com/sql/sql_null_values.asp[here].
-====
-
-.Items to submit
-====
-- SQL queries used to solve this problem.
-- 1-2 sentences explainging what a possible explanation for the `zip_code`s could be.
-====
-
-=== Question 4
-
-In (2) we wrote a query that finds the average `mean_temperature_f` by `zip_code`. What if we want to tack on our results in (2) to information from each row in the `station` table based on the `zip_code`? To do this, use an INNER JOIN. INNER JOIN combines tables based on specified fields, and returns only rows where there is a match in both the "left" and "right" tables.
-
-[TIP]
-====
-Use the query from (2) as a sub query within your solution.
-====
-
-[TIP]
-====
-Here is a video about JOIN and LEFT JOIN.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- SQL query used to solve this problem.
-====
-
-=== Question 5
-
-In (3) we alluded to the fact that many `zip_code` in the `trip` table aren't very consistent. Users can enter a zip code when using the app. This means that `zip_code` can be from anywhere in the world! With that being said, if the `zip_code` is one of the 5 `zip_code` for which we have weather data (from question 2), we can add that weather information to matching rows of the `trip` table. In (4) we used an INNER JOIN to append some weather information to each row in the `station` table. For this question, write a query that performs an INNER JOIN and appends weather data from the `weather` table to the trip data from the `trip` table. Limit your output to 5 lines.
-
-[IMPORTANT]
-====
-Notice that the weather data has about 1 row of weather information for each date and each zip code. This means you may have to join your data based on multiple constraints instead of just 1 like in (4). In the `trip` table, you can use `start_date` for for the date information.
-====
-
-[TIP]
-====
-You will want to wrap your dates and datetimes in https://www.sqlitetutorial.net/sqlite-date-functions/sqlite-date-function/[sqlite's `date` function] prior to comparison.
-====
-
-.Items to submit
-====
-- SQL query used to solve this problem.
-- First 5 lines of output.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project13.adoc
deleted file mode 100644
index 3ceb4cb04..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project13.adoc
+++ /dev/null
@@ -1,155 +0,0 @@
-= STAT 29000: Project 13 -- Fall 2020
-
-**Motivation:** Databases you will work with won't necessarily come organized in the way that you like. Getting really comfortable writing longer queries where you have to perform many joins, alias fields and tables, and aggregate results, is important. In addition, gaining some familiarity with terms like _primary key_, and _foreign key_ will prove useful when you need to search for help online. In this project we will write some more complicated queries with a fun database. Proper preparation prevents poor performance, and that means practice!
-
-**Context:** We are towards the end of a series of projects that give you an opportunity to practice using SQL. In this project, we will reinforce topics you've already learned, with a focus on subqueries and joins.
-
-**Scope:** SQL, sqlite
-
-.Learning objectives
-****
-- Write and run SQL queries in `sqlite` on real-world data.
-- Identify primary and foreign keys in a SQL table.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/movies_and_tv/imdb.db`
-
-[IMPORTANT]
-====
-For this project you will use SQLite to access the data. To connect to the database, copy and paste the following before your solutions in your .Rmd:
-
-````markdown
-```{r, include=F}`r ''`
-library(RSQLite)
-imdb <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/movies_and_tv/imdb.db")
-```
-````
-====
-
-If you want to use a SQLite-specific function like `.tables` (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:
-
-```{bash, eval=F}
-sqlite3 /class/datamine/data/movies_and_tv/imdb.db
-```
-
-From there, the SQLite-specific commands will function properly. They will _not_ function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option `eval=F` like this:
-
-````markdown
-```{sql, connection=imdb, eval=F}`r ''`
-SELECT * FROM titles LIMIT 5;
-```
-````
-
-This will allow the code to be displayed without throwing an error.
-
-== Questions
-
-=== Question 1
-
-A primary key is a field in a table which uniquely identifies a row in the table. Primary keys _must_ be unique values, and this is enforced at the database level. A foreign key is a field whose value matches a primary key in a different table. A table can have 0-1 primary key, but it can have 0+ foreign keys. Examine the `titles` table. Do you think there are any primary keys? How about foreign keys? Now examine the `episodes` table. Based on observation and the column names, do you think there are any primary keys? How about foreign keys?
-
-[TIP]
-====
-A primary key can also be a foreign key.
-====
-
-[TIP]
-====
-Here are two videos. The first video will remind you how to find the names of all of the tables in the `imdb` database. The second video will introduce you to the `titles` and `episodes` tables in the `imdb` database.
-====
-
-++++
-
-++++
-
-++++
-
-++++
-
-.Items to submit
-====
-- List any primary or foreign keys in the `titles` table.
-- List any primary or foreign keys in the `episodes` table.
-====
-
-=== Question 2
-
-If you paste a `title_id` to the end of the following url, it will pull up the page for the title. For example, https://www.imdb.com/title/tt0413573 leads to the page for the TV series _Grey's Anatomy_. Write a SQL query to confirm that the `title_id` tt0413573 does indeed belong to _Grey's Anatomy_. Then browse imdb.com and find your favorite TV show. Get the `title_id` from the url of your favorite TV show and run the following query, to confirm that the TV show is in our database:
-
-[source,SQL]
-----
-SELECT * FROM titles WHERE title_id='';
-----
-
-Make sure to replace "" with the `title_id` of your favorite show. If your show does not appear, or has only a single season, pick another show until you find one we have in our database with multiple seasons.
-
-.Items to submit
-====
-- SQL query used to confirm that `title_id` tt0413573 does indeed belong to _Grey's Anatomy_.
-- The output of the query.
-- The `title_id` of your favorite TV show.
-- SQL query used to confirm the `title_id` for your favorite TV show.
-- The output of the query.
-====
-
-=== Question 3
-
-The `episode_title_id` column in the `episodes` table references titles of individual episodes of a TV series. The `show_title_id` references the titles of the show itself. With that in mind, write a query that gets a list of all `episodes_title_id` (found in the `episodes` table), with the associated `primary_title` (found in the `titles` table) for each episode of _Grey's Anatomy_.
-
-[TIP]
-====
-This video shows how to extract titles of episodes in the `imdb` database.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- SQL query used to solve the problem in a code chunk.
-====
-
-=== Question 4
-
-We want to write a query that returns the title and rating of the highest rated episode of your favorite TV show, which you chose in (2). In order to do so, we will break the task into two parts in (4) and (5). First, write a query that returns a list of `episode_title_id` (found in the `episodes` table), with the associated `primary_title` (found in the `titles` table) for each episode.
-
-[TIP]
-====
-This part is just like question (3) but this time with your favorite TV show, which you chose in (2).
-====
-
-[TIP]
-====
-This video shows how to use a subquery, to `JOIN` a total of three tables in the `imdb` database.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- SQL query used to solve the problem in a code chunk.
-- The first 5 results from your query.
-====
-
-=== Question 5
-
-Write a query that adds the rating to the end of each episode. To do so, use the query you wrote in (4) as a subquery. Which episode has the highest rating? Is it also your favorite episode?
-
-[NOTE]
-====
-Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here].
-====
-
-.Items to submit
-====
-- SQL query used to solve the problem in a code chunk.
-- The `episode_title_id`, `primary_title`, and `rating` of the top rated episode from your favorite TV series, in question (2).
-- A statement saying whether the highest rated episode is also your favorite episode.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project14.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project14.adoc
deleted file mode 100644
index 36c177b92..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project14.adoc
+++ /dev/null
@@ -1,133 +0,0 @@
-= STAT 29000: Project 14 -- Fall 2020
-
-**Motivation:** As we learned earlier in the semester, bash scripts are a powerful tool when you need to perform repeated tasks in a UNIX-like system. In addition, sometimes preprocessing data using UNIX tools prior to analysis in R or Python is useful. Ample practice is integral to becoming proficient with these tools. As such, we will be reviewing topics learned earlier in the semester.
-
-**Context:** We've just ended a series of projects focused on SQL. In this project we will begin to review topics learned throughout the semester, starting writing bash scripts using the various UNIX tools we learned about in Projects 3 through 8.
-
-**Scope:** awk, UNIX utilities, bash scripts, fread
-
-.Learning objectives
-****
-- Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.
-- Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.
-- Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.
-- Use grep to search files effectively.
-- Use cut to section off data from the command line.
-- Use piping to string UNIX commands together.
-- Use awk for data extraction, and preprocessing.
-- Create bash scripts to automate a process or processes.
-****
-
-== Dataset
-
-The following questions will use ENTIRE_PLOTSNAP.csv from the data folder found in Scholar:
-
-`/anvil/projects/tdm/data/forest/`
-
-To read more about ENTIRE_PLOTSNAP.csv that you will be working with:
-
-https://www.uvm.edu/femc/data/archive/project/federal-forest-inventory-analysis-data-for/dataset/plot-level-data-gathered-through-forest/metadata#fields
-
-== Questions
-
-=== Question 1
-
-Take a look at at `ENTIRE_PLOTSNAP.csv`. Write a line of awk code that displays the `STATECD` followed by the number of rows with that `STATECD`.
-
-.Items to submit
-====
-- Code used to solve the problem.
-- Count of the following `STATECD`s: 1, 2, 4, 5, 6
-====
-
-=== Question 2
-
-Unfortunately, there isn't a very accessible list available that shows which state each `STATECD` represents. This is no problem for us though, the dataset has `LAT` and `LON`! Write some bash that prints just the `STATECD`, `LAT`, and `LON`.
-
-[NOTE]
-====
-There are 92 columns in our dataset: `awk -F, 'NR==1{print NF}' ENTIRE_PLOTSNAP.csv`. To create a list of `STATECD` to state, we only really need `STATECD`, `LAT`, and `LON`. Keeping the other 89 variables will keep our data at 2.6gb.
-====
-
-.Items to submit
-====
-- Code used to solve the problem.
-- The output of your code piped to `head`.
-====
-
-=== Question 3
-
-`fread` is a "Fast and Friendly File Finagler". It is part of the very popular `data.table` package in R. We will learn more about this package next semester. For now, read the documentation https://www.rdocumentation.org/packages/data.table/versions/1.12.8/topics/fread[here] and use the `cmd` argument in conjunction with your bash code from (2) to read the data of `STATECD`, `LAT`, and `LON` into a `data.table` in your R environment.
-
-.Items to submit
-====
-- Code used to solve the problem.
-- The `head` of the resulting `data.table`.
-====
-
-=== Question 4
-
-We are going to further understand the data from question (3) by finding the actual locations based on the `LAT` and `LON` columns. We can use the library `revgeo` to get a location given a pair of longitude and latitude values. `revgeo` uses a free API hosted by https://github.com/komoot/photon[photon] in order to do so.
-
-For example:
-
-[source,r]
-----
-library(revgeo)
-revgeo(longitude=-86.926153, latitude=40.427055, output='frame')
-----
-
-The code above will give you the address information in six columns, from the most-granular `housenumber` to the least-granular `country`. Depending on the coordinates, `revgeo` may or may not give you results for each column. For this question, we are going to keep only the `state` column.
-
-There are over 4 million rows in our dataset -- we do _not_ want to hit https://github.com/komoot/photon[photon's] API that many times. Instead, we are going to do the following:
-
-* Unless you feel comfortable using `data.table`, convert your `data.table` to a `data.frame`:
-
-[source,r]
-----
-my_dataframe <- data.frame(my_datatable)
-----
-
-* Calculate the average `LAT` and `LON` for each `STATECD`, and call the new `data.frame`, `dat`. This should result in 57 rows of lat/long pairs.
-
-* For each row in `dat`, run a reverse geocode and append the `state` to a new column called `STATE`.
-
-[TIP]
-====
-To calculate the average `LAT` and `LON` for each `STATECD`, you could use the https://www.rdocumentation.org/packages/sqldf/versions/0.4-11[`sqldf`] package to run SQL queries on your `data.frame`.
-====
-
-[TIP]
-====
-https://stackoverflow.com/questions/3505701/grouping-functions-tapply-by-aggregate-and-the-apply-family[`mapply`] is a useful apply function to use to solve this problem.
-====
-
-[TIP]
-====
-Here is some extra help:
-
-[source,r]
-----
-library(revgeo)
-points <- data.frame(latitude=c(40.433663, 40.432104, 40.428486), longitude=c(-86.916584, -86.919610, -86.920866))
-# Note that the "output" argument gets passed to the "revgeo" function.
-mapply(revgeo, points$longitude, points$latitude, output="frame")
-# The output isn't in a great format, and we'd prefer to just get the "state" data.
-# Let's wrap "revgeo" into another function that just gets "state" and try again.
-get_state <- function(lon, lat) {
- return(revgeo(lon, lat, output="frame")["state"])
-}
-mapply(get_state, points$longitude, points$latitude)
-----
-====
-
-[IMPORTANT]
-====
-It is okay to get "Not Found" for some of the addresses.
-====
-
-.Items to submit
-====
-- Code used to solve the problem.
-- The `head` of the resulting `data.frame`.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project15.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project15.adoc
deleted file mode 100644
index f0c2eb117..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project15.adoc
+++ /dev/null
@@ -1,102 +0,0 @@
-== STAT 29000: Project 15 -- Fall 2020
-
-**Motivation:** We've done a lot of work with SQL this semester. Let's review concepts in this project and mix and match R and SQL to solve data-driven problems.
-
-**Context:** In this project, we will reinforce topics you've already learned, with a focus on SQL.
-
-**Scope:** SQL, sqlite, R
-
-.Learning objectives
-****
-- Write and run SQL queries in `sqlite` on real-world data.
-- Use SQL from within R.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/movies_and_tv/imdb.db`
-
-== Questions
-
-=== Question 1
-
-What is the first year where our database has > 1000 titles? Use the `premiered` column in the `titles` table as our year. What year has the most titles?
-
-[TIP]
-====
-There could be missing values in `premiered`. We want to avoid them in SQL queries, for now. You can learn more about the missing values (or NULL) in SQL https://www.w3schools.com/sql/sql_null_values.asp[here].
-====
-
-.Items to submit
-====
-- SQL queries used to answer the questions.
-- What year is the first year to have > 1000 titles?
-- What year has the most titles?
-====
-
-=== Question 2
-
-What, and how many, unique `type` are there from the `titles` table? For the year found in question (1) with the most `titles`, how many titles of each `type` are there?
-
-.Items to submit
-====
-- SQL queries used to answer the questions.
-- How many and what are the unique `types` from the `titles` table?
-- A list of `type` and and count for the year (`premiered`) that had the most `titles`.
-====
-
-F.R.I.E.N.D.S is a popular tv show. They have an interesting naming convention for the names of their episodes. They all begin with the text "The One ...". There are 6 primary characters in the show: Chandler, Joey, Monica, Phoebe, Rachel, and Ross. Let's use SQL and R to take a look at how many times each characters' names appear in the title of the episodes.
-
-=== Question 3
-
-Write a query that gets the `episode_title_id`, `primary_title`, `rating`, and `votes`, of all of the episodes of Friends (`title_id` is tt0108778).
-
-[TIP]
-====
-You can slightly modify the solution to question (5) in project 13.
-====
-
-.Items to submit
-====
-- SQL query used to answer the question.
-- First 5 results of the query.
-====
-
-=== Question 4
-
-Now that you have a working query, connect to the database and run the query to get the data into an R data frame. In previous projects, we learned how to used regular expressions to search for text. For each character, how many episodes `primary_title`s contained their name?
-
-.Items to submit
-====
-- R code in a code chunk that was used to find the solution.
-- The solution pasted below the code chunk.
-====
-
-=== Question 5
-
-Create a graphic showing our results in (4) using your favorite package. Make sure the plot has a good title, x-label, y-label, and try to incorporate some of the following colors: #273c8b, #bd253a, #016f7c, #f56934, #016c5a, #9055b1, #eaab37.
-
-.Items to submit
-====
-- The R code used to generate the graphic.
-- The graphic in a png or jpg/jpeg format.
-====
-
-=== Question 6
-
-Use a combination of SQL and R to find which of the following 3 genres has the highest average rating for movies (see `type` column from `titles` table): Romance, Comedy, Animation. In the `titles` table, you can find the genres in the `genres` column. There may be some overlap (i.e. a movie may have more than one genre), this is ok.
-
-To query rows which have the genre Action as one of its genres:
-
-[source,SQL]
-----
-SELECT * FROM titles WHERE genres LIKE '%action%';
-----
-
-.Items to submit
-====
-- Any code you used to solve the problem in a code chunk.
-- The average rating of each of the genres listed for movies.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project01.adoc
deleted file mode 100644
index fd0ebf83a..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project01.adoc
+++ /dev/null
@@ -1,170 +0,0 @@
-= STAT 39000: Project 1 -- Fall 2020
-
-**Motivation:** In this project we will jump right into an R review. In this project we are going to break one larger data-wrangling problem into discrete parts. There is a slight emphasis on writing functions and dealing with strings. At the end of this project we will have greatly simplified a dataset, making it easy to dig into.
-
-**Context:** We just started the semester and are digging into a large dataset, and in doing so, reviewing R concepts we've previously learned.
-
-**Scope:** data wrangling in R, functions
-
-.Learning objectives
-****
-- Comprehend what a function is, and the components of a function in R.
-- Read and write basic (csv) data.
-- Utilize apply functions in order to solve a data-driven problem.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-You can find useful examples that walk you through relevant material in The Examples Book:
-
-https://thedatamine.github.io/the-examples-book
-
-It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
-
-[IMPORTANT]
-====
-It is highly recommended that you use https://rstudio.scholar.rcac.purdue.edu/. Simply click on the link and login using your Purdue account credentials.
-====
-
-We decided to move away from ThinLinc and away from the version of RStudio used last year (https://desktop.scholar.rcac.purdue.edu). The version of RStudio is known to have some strange issues when running code chunks.
-
-Remember the very useful documentation shortcut `?`. To use, simply type `?` in the console, followed by the name of the function you are interested in.
-
-You can also look for package documentation by using `help(package=PACKAGENAME)`, so for example, to see the documentation for the package `ggplot2`, we could run:
-
-[source,r]
-----
-help(package=ggplot2)
-----
-
-Sometimes it can be helpful to see the source code of a defined function. A https://www.tutorialspoint.com/r/r_functions.htm[function] is any chunk of organized code that is used to perform an operation. Source code is the underlying `R` or `c` or `c++` code that is used to create the function. To see the source code of a defined function, type the function's name without the `()`. For example, if we were curious about what the function `Reduce` does, we could run:
-
-[source,r]
-----
-Reduce
-----
-
-Occasionally this will be less useful as the resulting code will be code that calls `c` code we can't see. Other times it will allow you to understand the function better.
-
-== Dataset:
-
-`/class/datamine/data/airbnb`
-
-Often times (maybe even the majority of the time) data doesn't come in one nice file or database. Explore the datasets in `/class/datamine/data/airbnb`.
-
-== Questions
-
-[IMPORTANT]
-====
-Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
-====
-
-[IMPORTANT]
-====
-Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
-====
-
-=== Question 1
-
-You may have noted that, for each country, city, and date we can find 3 files: `calendar.csv.gz`, `listings.csv.gz`, and `reviews.csv.gz` (for now, we will ignore all files in the "visualisations" folders).
-
-Let's take a look at the data in each of the three types of files. Pick a country, city and date, and read the first 50 rows of each of the 3 datasets (`calendar.csv.gz`, `listings.csv.gz`, and `reviews.csv.gz`). Provide 1-2 sentences explaining the type of information found in each, and what variable(s) could be used to join them.
-
-[TIP]
-====
-`read.csv` has an argument to select the number of rows we want to read.
-====
-
-[TIP]
-====
-Depending on the country that you pick, the listings and/or the reviews might not display properly in RMarkdown. So you do not need to display the first 50 rows of the listings and/or reviews, in your RMarkdown document. It is OK to just display the first 50 rows of the calendar entries.
-====
-
-To read a compressed csv, simply use the `read.csv` function:
-
-[source,r]
-----
-dat <- read.csv("/class/datamine/data/airbnb/brazil/rj/rio-de-janeiro/2019-06-19/data/calendar.csv.gz")
-head(dat)
-----
-
-Let's work towards getting this data into an easier format to analyze. From now on, we will focus on the `listings.csv.gz` datasets.
-
-.Items to submit
-====
-- Chunk of code used to read the first 50 rows of each dataset.
-- 1-2 sentences briefly describing the information contained in each dataset.
-- Name(s) of variable(s) that could be used to join them.
-====
-
-=== Question 2
-
-Write a function called `get_paths_for_country`, that, given a string with the country name, returns a vector with the full paths for all `listings.csv.gz` files, starting with `/class/datamine/data/airbnb/...`.
-
-For example, the output from `get_paths_for_country("united-states")` should have 28 entries. Here are the first 5 entries in the output:
-
-----
- [1] "/class/datamine/data/airbnb/united-states/ca/los-angeles/2019-07-08/data/listings.csv.gz"
- [2] "/class/datamine/data/airbnb/united-states/ca/oakland/2019-07-13/data/listings.csv.gz"
- [3] "/class/datamine/data/airbnb/united-states/ca/pacific-grove/2019-07-01/data/listings.csv.gz"
- [4] "/class/datamine/data/airbnb/united-states/ca/san-diego/2019-07-14/data/listings.csv.gz"
- [5] "/class/datamine/data/airbnb/united-states/ca/san-francisco/2019-07-08/data/listings.csv.gz"
-----
-
-[TIP]
-====
-`list.files` is useful with the `recursive=T` option.
-====
-
-[TIP]
-====
-Use `grep` to search for the pattern `listings.csv.gz` (within the results from the first hint), and use the option `value=T` to display the values found by the `grep` function.
-====
-
-.Items to submit
-====
-- Chunk of code for your `get_paths_for_country` function.
-====
-
-=== Question 3
-
-Write a function called `get_data_for_country` that, given a string with the country name, returns a data.frame containing the all listings data for that country. Use your previously written function to help you.
-
-[TIP]
-====
-Use `stringsAsFactors=F` in the `read.csv` function.
-====
-
-[TIP]
-====
-Use `do.call(rbind, )` to combine a list of dataframes into a single dataframe.
-====
-
-.Items to submit
-====
-- Chunk of code for your `get_data_for_country` function.
-====
-
-=== Question 4
-
-Use your `get_data_for_country` to get the data for a country of your choice, and make sure to name the data.frame `listings`. Take a look at the following columns: `host_is_superhost`, `host_has_profile_pic`, `host_identity_verified`, and `is_location_exact`. What is the data type for each column? (You can use `class` or `typeof` or `str` to see the data type.)
-
-These columns would make more sense as logical values (TRUE/FALSE/NA).
-
-Write a function called `transform_column` that, given a column containing lowercase "t"s and "f"s, your function will transform it to logical (TRUE/FALSE/NA) values. Note that NA values for these columns appear as blank (`""`), and we need to be careful when transforming the data. Test your function on column `host_is_superhost`.
-
-.Items to submit
-====
-- Chunk of code for your `transform_column` function.
-- Type of `transform_column(listings$host_is_superhost)`.
-====
-
-=== Question 5
-
-Create a histogram for response rates (`host_response_rate`) for super hosts (where `host_is_superhost` is `TRUE`). If your listings do not contain any super hosts, load data from a different country. Note that we first need to convert `host_response_rate` from a character containing "%" signs to a numeric variable.
-
-.Items to submit
-====
-- Chunk of code used to answer the question.
-- Histogram of response rates for super hosts.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project02.adoc
deleted file mode 100644
index 997cc587c..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project02.adoc
+++ /dev/null
@@ -1,198 +0,0 @@
-= STAT 39000: Project 2 -- Fall 2020
-
-**Motivation:** The ability to quickly reproduce an analysis is important. It is often necessary that other individuals will need to be able to understand and reproduce an analysis. This concept is so important there are classes solely on reproducible research! In fact, there are papers that investigate and highlight the lack of reproducibility in various fields. If you are interested in reading about this topic, a good place to start is the paper titled https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124["Why Most Published Research Findings Are False"] by John Ioannidis (2005).
-
-**Context:** Making your work reproducible is extremely important. We will focus on the computational part of reproducibility. We will learn RMarkdown to document your analyses so others can easily understand and reproduce the computations that led to your conclusions. Pay close attention as future project templates will be RMarkdown templates.
-
-**Scope:** Understand Markdown, RMarkdown, and how to use it to make your data analysis reproducible.
-
-.Learning objectives
-****
-- Use Markdown syntax within an Rmarkdown document to achieve various text transformations.
-- Use RMarkdown code chunks to display and/or run snippets of code.
-****
-
-== Questions
-
-++++
-
-++++
-
-=== Question 1
-
-Make the following text (including the asterisks) bold: `This needs to be **very** bold`. Make the following text (including the underscores) italicized: `This needs to be _very_ italicized.`
-
-[IMPORTANT]
-====
-Surround your answer in 4 backticks. This will allow you to display the markdown _without_ having the markdown "take effect". For example:
-
-`````markdown
-````
-Some *marked* **up** text.
-````
-`````
-====
-
-[TIP]
-====
-Be sure to check out the https://rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf[Rmarkdown Cheatsheet] and our section on https://thedatamine.github.io/the-examples-book/r.html#r-rmarkdown[Rmarkdown in the book].
-====
-
-[NOTE]
-====
-Rmarkdown is essentially Markdown + the ability to run and display code chunks. In this question, we are actually using Markdown within Rmarkdown!
-====
-
-
-.Items to submit
-====
-- 2 lines of markdown text, surrounded by 4 backticks. Note that when compiled, this text will be unmodified, regular text.
-====
-
-=== Question 2
-
-Create an unordered list of your top 3 favorite academic interests (some examples could include: machine learning, operating systems, forensic accounting, etc.). Create another *ordered* list that ranks your academic interests in order of most interested to least interested.
-
-[TIP]
-====
-You can learn what ordered and unordered lists are [here](https://rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf).
-====
-
-[NOTE]
-====
-Similar to (1), in this question we are dealing with Markdown. If we were to copy and paste the solution to this problem in a Markdown editor, it would be the same result as when we Knit it here.
-====
-
-.Items to submit
-====
-- Create the lists, this time don't surround your code in backticks. Note that when compiled, this text will appear as nice, formatted lists.
-====
-
-=== Question 3
-
-Browse https://www.linkedin.com/ and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown. Include the following:
-
-- A header for this section (your choice of size) that says "About".
-- The text of your personal "About" section that you would feel comfortable uploading to linkedin, including at least 1 link.
-
-.Items to submit
-====
-- Create the described profile, don't surround your code in backticks.
-====
-
-=== Question 4
-
-LaTeX is a powerful editing tool where you can create beautifully formatted equations and formulas. Replicate the equation found https://wikimedia.org/api/rest_v1/media/math/render/svg/87c061fe1c7430a5201eef3fa50f9d00eac78810[here] as closely as possible.
-
-[TIP]
-====
-Lookup "latex mid" and "latex frac".
-====
-
-.Items to submit
-====
-- Replicate the equation using LaTeX under the Question 4 header in your template.
-====
-
-=== Question 5
-
-Your co-worker wrote a report, and has asked you to beautify it. Knowing Rmarkdown, you agreed. Make improvements to this section. At a minimum:
-
-- Make the title pronounced.
-- Make all links appear as a word or words, rather than the long-form URL.
-- Organize all code into code chunks where code and output are displayed. If the output is really long, just display the code.
-- Make the calls to the `library` function be evaluated but not displayed.
-- Make sure all warnings and errors that may eventually occur, do not appear in the final document.
-
-Feel free to make any other changes that make the report more visually pleasing.
-
-````markdown
-`r ''````{r my-load-packages}
-library(ggplot2)
-```
-
-`r ''````{r declare-variable-390, eval=FALSE}
-my_variable <- c(1,2,3)
-```
-
-All About the Iris Dataset
-
-This paper goes into detail about the `iris` dataset that is built into r. You can find a list of built-in datasets by visiting https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html or by running the following code:
-
-data()
-
-The iris dataset has 5 columns. You can get the names of the columns by running the following code:
-
-names(iris)
-
-Alternatively, you could just run the following code:
-
-iris
-
-The second option provides more detail about the dataset.
-
-According to https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html there is another dataset built-in to r called `iris3`. This dataset is 3 dimensional instead of 2 dimensional.
-
-An iris is a really pretty flower. You can see a picture of one here:
-
-https://www.gardenia.net/storage/app/public/guides/detail/83847060_mOptimized.jpg
-
-In summary. I really like irises, and there is a dataset in r called `iris`.
-````
-
-.Items to submit
-====
-- Make improvements to this section, and place it all under the Question 5 header in your template.
-====
-
-=== Question 6
-
-Create a plot using a built-in dataset like `iris`, `mtcars`, or `Titanic`, and display the plot using a code chunk. Make sure the code used to generate the plot is hidden. Include a descriptive caption for the image. Make sure to use an RMarkdown chunk option to create the caption.
-
-.Items to submit
-====
-- Code chunk under that creates and displays a plot using a built-in dataset like `iris`, `mtcars`, or `Titanic`.
-====
-
-=== Question 7
-
-Insert the following code chunk under the Question 7 header in your template. Try knitting the document. Two things will go wrong. What is the first problem? What is the second problem?
-
-````markdown
-```{r my-load-packages}`r ''`
-plot(my_variable)
-```
-````
-
-[TIP]
-====
-Take a close look at the name we give our code chunk.
-====
-
-[TIP]
-====
-Take a look at the code chunk where `my_variable` is declared.
-====
-
-.Items to submit
-====
-- The modified version of the inserted code that fixes both problems.
-- A sentence explaining what the first problem was.
-- A sentence explaining what the second problem was.
-====
-
-=== For Project 2, please submit your .Rmd file and the resulting .pdf file. (For this project, you do not need to submit a .R file.)
-
-=== OPTIONAL QUESTION
-
-RMarkdown is also an excellent tool to create a slide deck. Use the information https://rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf[here] or https://thedatamine.github.io/the-examples-book/r.html#how-do-i-create-a-set-of-slides-using-rmarkdown[here] to convert your solutions into a slide deck rather than the regular PDF. You may experiment with `slidy`, `ioslides` or `beamer`, however, make your final set of solutions use `beamer` as the output is a PDF. Make any needed modifications to make the solutions knit into a well-organized slide deck (For example, include slide breaks and make sure the contents are shown completely.). Modify (2) so the bullets are incrementally presented as the slides progress.
-
-[IMPORTANT]
-====
-You do _not_ need to submit the original PDF for this project, just the `beamer` slide version of the PDF.
-====
-
-.Items to submit
-====
-- The modified version of the solutions in `beamer` slide form.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project03.adoc
deleted file mode 100644
index 6ae98fe85..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project03.adoc
+++ /dev/null
@@ -1,214 +0,0 @@
-= STAT 39000: Project 3 -- Fall 2020
-
-**Motivation:** The ability to navigate a shell, like `bash`, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful `bash` tools, help you navigate a filesystem, and even run `bash` tools from within an RMarkdown file in RStudio.
-
-**Context:** At this point in time, you will each have varying levels of familiarity with Scholar. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within RStudio in an RMarkdown file.
-
-**Scope:** bash, RStudio
-
-.Learning objectives
-****
-- Distinguish differences in /home, /scratch, and /class.
-- Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.
-- Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.
-- Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.
-- Utilize other Scholar resources: rstudio.scholar.rcac.purdue.edu, notebook.scholar.rcac.purdue.edu, desktop.scholar.rcac.purdue.edu, etc.
-- Use `man` to read and learn about UNIX utilities.
-- Run `bash` commands from within and RMarkdown file in RStudio.
-****
-
-There are a variety of ways to connect to Scholar. In this class, we will _primarily_ connect to RStudio Server by opening a browser and navigating to https://rstudio.scholar.rcac.purdue.edu/, entering credentials, and using the excellent RStudio interface.
-
-Here is a video to remind you about some of the basic tools you can use in UNIX/Linux:
-
-++++
-
-++++
-
-This is the easiest book for learning this stuff; it is short and gets right to the point:
-
-https://learning.oreilly.com/library/view/learning-the-unix/0596002610
-
-You just log in and you can see it all; we suggest Chapters 1, 3, 4, 5, 7 (you can basically skip chapters 2 and 6 the first time through).
-
-It is a very short read (maybe, say, 2 or 3 hours altogether?), just a thin book that gets right to the details.
-
-== Questions
-
-=== Question 1
-
-Navigate to https://rstudio.scholar.rcac.purdue.edu/ and login. Take some time to click around and explore this tool. We will be writing and running Python, R, SQL, and `bash` all from within this interface. Navigate to `Tools > Global Options ...`. Explore this interface and make at least 2 modifications. List what you changed.
-
-Here are some changes Kevin likes:
-
-- Uncheck "Restore .Rdata into workspace at startup".
-- Change tab width 4.
-- Check "Soft-wrap R source files".
-- Check "Highlight selected line".
-- Check "Strip trailing horizontal whitespace when saving".
-- Uncheck "Show margin".
-
-(Dr. Ward does not like to customize his own environment, but he does use the emacs key bindings: Tools > Global Options > Code > Keybindings, but this is only recommended if you already know emacs.)
-
-.Items to submit
-====
-- List of modifications you made to your Global Options.
-====
-
-=== Question 2
-
-There are four primary panes, each with various tabs. In one of the panes there will be a tab labeled "Terminal". Click on that tab. This terminal by default will run a `bash` shell right within Scholar, the same as if you connected to Scholar using ThinLinc, and opened a terminal. Very convenient!
-
-What is the default directory of your bash shell?
-
-[TIP]
-====
-Start by reading the section on `man`. `man` stands for manual, and you can find the "official" documentation for the command by typing `man `. For example:
-====
-
-[source,bash]
-----
-# read the manual for the `man` command
-# use "k" or the up arrow to scroll up, "j" or the down arrow to scroll down
-man man
-----
-
-.Items to submit
-====
-- The full filepath of default directory (home directory). Ex: Kevin's is: `/home/kamstut`
-- The `bash` code used to show your home directory or current directory (also known as the working directory) when the `bash` shell is first launched.
-====
-
-=== Question 3
-
-Learning to navigate away from our home directory to other folders, and back again, is vital. Perform the following actions, in order:
-
-- Write a single command to navigate to the folder containing our full datasets: `/class/datamine/data`.
-- Write a command to confirm you are in the correct folder.
-- Write a command to list the files and directories within the data directory. (You do not need to recursively list subdirectories and files contained therein.) What are the names of the files and directories?
-- Write another command to return back to your home directory.
-- Write a command to confirm you are in the correct folder.
-
-Note: `/` is commonly referred to as the root directory in a linux/unix filesystem. Think of it as a folder that contains _every_ other folder in the computer. `/home` is a folder within the root directory. `/home/kamstut` is the full filepath of Kevin's home directory. There is a folder `home` inside the root directory. Inside `home` is another folder named `kamstut` which is Kevin's home directory.
-
-.Items to submit
-====
-- Command used to navigate to the data directory.
-- Command used to confirm you are in the data directory.
-- Command used to list files and folders.
-- List of files and folders in the data directory.
-- Command used to navigate back to the home directory.
-- Commnad used to confirm you are in the home directory.
-====
-
-=== Question 4
-
-Let's learn about two more important concepts. `.` refers to the current working directory, or the directory displayed when you run `pwd`. Unlike `pwd` you can use this when navigating the filesystem! So, for example, if you wanted to see the contents of a file called `my_file.txt` that lives in `/home/kamstut` (so, a full path of `/home/kamstut/my_file.txt`), and you are currently in `/home/kamstut`, you could run: `cat ./my_file.txt`.
-
-`..` represents the parent folder or the folder in which your current folder is contained. So let's say I was in `/home/kamstut/projects/` and I wanted to get the contents of the file `/home/kamstut/my_file.txt`. You could do: `cat ../my_file.txt`.
-
-When you navigate a directory tree using `.` and `..` you create paths that are called _relative_ paths because they are _relative_ to your current directory. Alternatively, a _full_ path or (_absolute_ path) is the path starting from the root directory. So `/home/kamstut/my_file.txt` is the _absolute_ path for `my_file.txt` and `../my_file.txt` is a _relative_ path. Perform the following actions, in order:
-
-- Write a single command to navigate to the data directory.
-- Write a single command to navigate back to your home directory using a _relative_ path. Do not use `~` or the `cd` command without a path argument.
-
-.Items to submit
-====
-- Command used to navigate to the data directory.
-- Command used to navigate back to your home directory that uses a _relative_ path.
-====
-
-=== Question 5
-
-In Scholar, when you want to deal with _really_ large amounts of data, you want to access scratch (you can read more https://www.rcac.purdue.edu/policies/scholar/[here]). Your scratch directory on Scholar is located here: `/scratch/scholar/$USER`. `$USER` is an environment variable containing your username. Test it out: `echo /scratch/scholar/$USER`. Perform the following actions:
-
-- Navigate to your scratch directory.
-- Confirm you are in the correct location.
-- Execute `myquota`.
-- Find the location of the `myquota` bash script.
-- Output the first 5 and last 5 lines of the bash script.
-- Count the number of lines in the bash script.
-- How many kilobytes is the script?
-
-[TIP]
-====
-You could use each of the commands in the relevant topics once.
-====
-
-[TIP]
-====
-When you type `myquota` on Scholar there are sometimes two warnings about `xauth` but sometimes there are no warnings. If you get a warning that says `Warning: untrusted X11 forwarding setup failed: xauth key data not generated` it is safe to ignore this error.
-====
-
-[TIP]
-====
-Commands often have _options_. _Options_ are features of the program that you can trigger specifically. You can see the _options_ of a command in the `DESCRIPTION` section of the `man` pages. For example: `man wc`. You can see `-m`, `-l`, and `-w` are all options for `wc`. To test this out:
-
-[source,bash]
-----
-# using the default wc command. "/class/datamine/data/flights/1987.csv" is the first "argument" given to the command.
-wc /class/datamine/data/flights/1987.csv
-# to count the lines, use the -l option
-wc -l /class/datamine/data/flights/1987.csv
-# to count the words, use the -w option
-wc -w /class/datamine/data/flights/1987.csv
-# you can combine options as well
-wc -w -l /class/datamine/data/flights/1987.csv
-# some people like to use a single tack `-`
-wc -wl /class/datamine/data/flights/1987.csv
-# order doesn't matter
-wc -lw /class/datamine/data/flights/1987.csv
-----
-====
-
-[TIP]
-====
-The `-h` option for the `du` command is useful.
-====
-
-.Items to submit
-====
-- Command used to navigate to your scratch directory.
-- Command used to confirm your location.
-- Output of `myquota`.
-- Command used to find the location of the `myquota` script.
-- Absolute path of the `myquota` script.
-- Command used to output the first 5 lines of the `myquota` script.
-- Command used to output the last 5 lines of the `myquota` script.
-- Command used to find the number of lines in the `myquota` script.
-- Number of lines in the script.
-- Command used to find out how many kilobytes the script is.
-- Number of kilobytes that the script takes up.
-====
-
-=== Question 6
-
-Perform the following operations:
-
-- Navigate to your scratch directory.
-- Copy and paste the file: `/class/datamine/data/flights/1987.csv` to your current directory (scratch).
-- Create a new directory called `my_test_dir` in your scratch folder.
-- Move the file you copied to your scratch directory, into your new folder.
-- Use `touch` to create an empty file named `im_empty.txt` in your scratch folder.
-- Remove the directory `my_test_dir` _and_ the contents of the directory.
-- Remove the `im_empty.txt` file.
-
-[TIP]
-====
-`rmdir` may not be able to do what you think, instead, check out the options for `rm` using `man rm`.
-====
-
-.Items to submit
-====
-- Command used to navigate to your scratch directory.
-- Command used to copy the file, `/class/datamine/data/flights/1987.csv` to your current directory (scratch).
-- Command used to create a new directory called `my_test_dir` in your scratch folder.
-- Command used to move the file you copied earlier `1987.csv` into your new `my_test_dir` folder.
-- Command used to create an empty file named `im_empty.txt` in your scratch folder.
-- Command used to remove the directory _and_ the contents of the directory `my_test_dir`.
-- Command used to remove the `im_empty.txt` file.
-====
-
-=== Question 7
-
-Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at datamine@purdue.edu for an alternative plan.
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project04.adoc
deleted file mode 100644
index 468cab60d..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project04.adoc
+++ /dev/null
@@ -1,193 +0,0 @@
-= STAT 39000: Project 4 -- Fall 2020
-
-**Motivation:** The need to search files and datasets based on the text held within is common during various parts of the data wrangling process. `grep` is an extremely powerful UNIX tool that allows you to do so using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/[even professionals can make critical mistakes]. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in.
-
-**Context:** We've just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, `grep`, and experiment with regular expressions using `grep`, R, and later on, Python.
-
-**Scope:** grep, regular expression basics, utilizing regular expression tools in R and Python
-
-.Learning objectives
-****
-- Use `grep` to search for patterns within a dataset.
-- Use `cut` to section off and slice up data from the command line.
-- Use `wc` to count the number of lines of input.
-****
-
-You can find useful examples that walk you through relevant material in The Examples Book:
-
-https://the-examples-book.com/book/
-
-It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
-
-[IMPORTANT]
-====
-I would highly recommend using single quotes `'` to surround your regular expressions. Double quotes can have unexpected behavior due to some shell's expansion rules. In addition, pay close attention to escaping certain https://unix.stackexchange.com/questions/20804/in-a-regular-expression-which-characters-need-escaping[characters] in your regular expressions.
-====
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/movies_and_tv/the_office_dialogue.csv`
-
-A public sample of the data can be found here: https://www.datadepot.rcac.purdue.edu/datamine/data/movies-and-tv/the_office_dialogue.csv[the_office_dialogue.csv]
-
-Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.
-
-`grep` stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate `grep`, we will be using it with textual data. You can read about and see examples of `grep` https://thedatamine.github.io/the-examples-book/unix.html#grep[here].
-
-== Question
-
-=== Question 1
-
-Login to Scholar and use `grep` to find the dataset we will use this project. The dataset we will use is the only dataset to have the text "Bears. Beets. Battlestar Galactica.". What is the name of the dataset and where is it located?
-
-.Items to submit
-====
-- The `grep` command used to find the dataset.
-- The name and location in Scholar of the dataset.
-- Use `grep` and `grepl` within R to solve a data-driven problem.
-====
-
-=== Question 2
-
-`grep` prints the line that the text you are searching for appears in. In project 3 we learned a UNIX command to quickly print the first _n_ lines from a file. Use this command to get the headers for the dataset. As you can see, each line in the tv show is a row in the dataset. You can count to see which column the various bits of data live in.
-
-Write a line of UNIX commands that searches for "bears. beets. battlestar galactica." and, rather than printing the entire line, prints only the character who speaks the line, as well as the line itself.
-
-[TIP]
-====
-The result if you were to search for "bears. beets. battlestar galactica." should be:
-
-----
-"Jim","Fact. Bears eat beets. Bears. Beets. Battlestar Galactica."
-----
-====
-
-[TIP]
-====
-One method to solve this problem would be to pipe the output from `grep` to `cut`.
-====
-
-.Items to submit
-====
-- The line of UNIX commands used to find the character and original dialogue line that contains "bears. beets. battlestar galactica.".
-====
-
-=== Question 3
-
-Find all of the lines where Pam is called "Beesley" instead of "Pam" or "Pam Beesley".
-
-[TIP]
-====
-A negative lookbehind would be one way to solve this, in order to use a negative lookbehind with `grep` make sure to add the -P option. In addition, make sure to use single quotes to make sure your regular expression is taken literally. If you use double quotes, variables are expanded.
-====
-
-Regular expressions are really a useful semi-language-agnostic tool. What this means is regardless of the programming language you are using, there will be some package that allows you to use regular expressions. In fact, we can use them in both R and Python! This can be particularly useful when dealing with strings. Load up the dataset you discovered in (1) using `read.csv`. Name the resulting data.frame `dat`.
-
-.Items to submit
-====
-- The UNIX command used to solve this problem.
-====
-
-=== Question 4
-
-The `text_w_direction` column in `dat` contains the characters' lines with inserted direction that helps characters know what to do as they are reciting the lines. Direction is shown between square brackets "[" "]". In this two-part question, we are going to use regular expression to detect the directions.
-
-(a) Create a new column called `has_direction` that is set to `TRUE` if the `text_w_direction` column has direction, and `FALSE` otherwise. Use the `grepl` function in R to accomplish this.
-
-[TIP]
-====
-Make sure all opening brackets "[" have a corresponding closing bracket "]".
-====
-
-[TIP]
-====
-Think of the pattern as any line that has a [, followed by any amount of any text, followed by a ], followed by any amount of any text.
-====
-
-(b) Modify your regular expression to find lines with 2 or more sets of direction. How many lines have more than 2 directions? Modify your code again and find how many have more than 5.
-
-We count the sets of direction in each line by the pairs of square brackets. The following are two simple example sentences.
-
-----
-This is a line with [emphasize this] only 1 direction!
-This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].
-----
-
-Your solution to part (a) should find both lines a match. However, in part (b) we want the regular expression pattern to find only lines with 2+ directions, so the first line would not be a match.
-
-In our actual dataset, for example, `dat$text_w_direction[2789]` is a line with 2 directions.
-
-.Items to submit
-====
-- The R code and regular expression used to solve the first part of this problem.
-- The R code and regular expression used to solve the second part of this problem.
-- How many lines have >= 2 directions?
-- How many lines have >= 5 directions?
-====
-
-=== Question 5
-
-Use the `str_extract_all` function from the `stringr` package to extract the direction(s) as well as the text between direction(s) from each line. Put the strings in a new column called `direction`.
-
-----
-This is a line with [emphasize this] only 1 direction!
-This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].
-----
-
-In this question, your solution may have extracted:
-
-----
-[emphasize this]
-[emphasize this] 2 sets of direction, do you see the difference [shrug]
-----
-
-It is okay to keep the text between neighboring pairs of "[" and "]" for the second line.
-
-.Items to submit
-====
-- The R code used to solve this problem.
-====
-
-=== OPTIONAL QUESTION
-
-Repeat (5) but this time make sure you only capture the brackets and text within the brackets. Save the results in a new column called `direction_correct`. You can test to see if it is working by running the following code:
-
-```{r, eval=F}
-dat$direction_correct[747]
-```
-
-----
-This is a line with [emphasize this] only 1 direction!
-This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].
-----
-
-In (5), your solution may have extracted:
-
-----
-[emphasize this]
-[emphasize this] 2 sets of direction, do you see the difference [shrug]
-----
-
-This is ok for (5). In this question, however, we want to fix this to only extract:
-
-----
-[emphasize this]
-[emphasize this] [shrug]
-----
-
-[TIP]
-====
-This regular expression will be hard to read.
-====
-
-[TIP]
-====
-The pattern we want is: literal opening bracket, followed by 0+ of any character other than the literal [ or literal ], followed by a literal closing bracket.
-====
-
-.Items to submit
-====
-- The R code used to solve this problem.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project05.adoc
deleted file mode 100644
index e53794127..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project05.adoc
+++ /dev/null
@@ -1,171 +0,0 @@
-= STAT 39000: Project 5 -- Fall 2020
-
-**Motivation:** Becoming comfortable stringing together commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc.
-
-**Context:** We've been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping.
-
-**Scope:** grep, regular expression basics, UNIX utilities, redirection, piping
-
-.Learning objectives
-****
-- Use `cut` to section off and slice up data from the command line.
-- Use piping to string UNIX commands together.
-- Use `sort` and it's options to sort data in different ways.
-- Use `head` to isolate _n_ lines of output.
-- Use `wc` to summarize the number of lines in a file or in output.
-- Use `uniq` to filter out non-unique lines.
-- Use `grep` to search files effectively.
-****
-
-You can find useful examples that walk you through relevant material in The Examples Book:
-
-https://the-examples-book.com/book/
-
-It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
-
-Don't forget the very useful documentation shortcut `?` for R code. To use, simply type `?` in the console, followed by the name of the function you are interested in. In the Terminal, you can use the `man` command to check the documentation of `bash` code.
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/amazon/amazon_fine_food_reviews.csv`
-
-A public sample of the data can be found here: https://www.datadepot.rcac.purdue.edu/datamine/data/amazon/amazon_fine_food_reviews.csv[amazon_fine_food_reviews.csv]
-
-Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.
-
-Here are three videos that might also be useful, as you work on Project 5:
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-
-== Questions
-
-=== Question 1
-
-What is the `Id` of the most helpful review, according to the highest `HelpfulnessNumerator`?
-
-[IMPORTANT]
-====
-You can always pipe output to `head` in case you want the first few values of a lot of output. Note that if you used `sort` before `head`, you may see the following error messages:
-
-----
-sort: write failed: standard output: Broken pipe
-sort: write error
-----
-
-This is because `head` would truncate the output from `sort`. This is okay. See https://stackoverflow.com/questions/46202653/bash-error-in-sort-sort-write-failed-standard-output-broken-pipe[this discussion] for more details.
-====
-
-.Items to submit
-====
-- Line of UNIX commands used to solve the problem.
-- The `Id` of the most helpful review.
-====
-
-=== Question 2
-
-Some entries under the `Summary` column appear more than once. Calculate the proportion of unique summaries over the total number of summaries. Use two lines of UNIX commands to find the numerator and the denominator, and manually calculate the proportion.
-
-To further clarify what we mean by _unique_, if we had the following vector in R, `c("a", "b", "a", "c")`, its unique values are `c("a", "b", "c")`.
-
-.Items to submit
-====
-- Two lines of UNIX commands used to solve the problem.
-- The ratio of unique `Summary`'s.
-====
-
-=== Question 3
-
-Use a chain of UNIX commands, piped in a sequence, to create a frequency table of `Score`.
-
-.Items to submit
-====
-- The line of UNIX commands used to solve the problem.
-- The frequency table.
-====
-
-=== Question 4
-
-Who is the user with the highest number of reviews? There are two columns you could use to answer this question, but which column do you think would be most appropriate and why?
-
-[TIP]
-====
-You may need to pipe the output to `sort` multiple times.
-====
-
-[TIP]
-====
-To create the frequency table, read through the `man` pages for `uniq`. Man pages are the "manual" pages for UNIX commands. You can read through the man pages for uniq by running the following:
-
-[source,bash]
-----
-man uniq
-----
-====
-
-.Items to submit
-====
-- The line of UNIX commands used to solve the problem.
-- The frequency table.
-====
-
-=== Question 5
-
-Anecdotally, there seems to be a tendency to leave reviews when we feel strongly (either positive or negative) about a product. For the user with the highest number of reviews (i.e., the user identified in question 4), would you say that they follow this pattern of extremes? Let's consider 5 star reviews to be strongly positive and 1 star reviews to be strongly negative. Let's consider anything in between neither strongly positive nor negative.
-
-[TIP]
-====
-You may find the solution to problem (3) useful.
-====
-
-.Items to submit
-====
-- The line of UNIX commands used to solve the problem.
-====
-
-=== Question 6
-
-Find the most helpful review with a `Score` of 5. Then (separately) find the most helpful review with a `Score` of 1. As before, we are considering the most helpful review to be the review with the highest `HelpfulnessNumerator`.
-
-[TIP]
-====
-You can use multiple lines to solve this problem.
-====
-
-.Items to submit
-====
-- The lines of UNIX commands used to solve the problem.
-- `ProductId`'s of both requested reviews.
-====
-
-=== Question 7
-
-For *only* the two `ProductId` from the previous question, create a new dataset called `scores.csv` that contains all `ProductId` and `Score` from all reviews for these two items.
-
-.Items to submit
-====
-- The line of UNIX commands used to solve the problem.
-====
-
-=== OPTIONAL QUESTION
-
-Use R to load up `scores.csv` into a new data.frame called `dat`. Create a histogram for each products' `Score`. Compare the most helpful review `Score` with those given in the histogram. Based on this comparison, point out some curiosities about the product that may be worth exploring. For example, if a product receives many high scores, but has a super helpful review that gives the product 1 star, I may tend to wonder if the product is not as great as it seems to be.
-
-.Items to submit
-====
-- R code used to create the histograms.
-- 3 histograms, 1 for each `ProductId`.
-- 1-2 sentences describing the curious pattern that you would like to further explore.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project06.adoc
deleted file mode 100644
index 4e5c64910..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project06.adoc
+++ /dev/null
@@ -1,215 +0,0 @@
-= STAT 39000: Project 6 -- Fall 2020
-
-**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
-
-**Context:** This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
-
-**Scope:** awk, UNIX utilities, bash scripts
-
-.Learning objectives
-****
-- Use `awk` to process and manipulate textual data.
-- Use piping and redirection within the terminal to pass around data between utilities.
-- Use output created from the terminal to create a plot using R.
-****
-
-== Dataset
-
-The following questions will use the dataset found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/YYYY.csv[here] or in Scholar:
-
-`/class/datamine/data/flights/subset/YYYY.csv`
-
-An example from 1987 data can be found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/1987.csv[here] or in Scholar:
-
-`/class/datamine/data/flights/subset/1987.csv`
-
-== Questions
-
-=== Question 1
-
-In previous projects we learned how to get a single column of data from a csv file. Write 1 line of UNIX commands to print the 17th column, the `Origin`, from `1987.csv`. Write another line, this time using `awk` to do the same thing. Which one do you prefer, and why?
-
-Here is an example, from a different data set, to illustrate some differences and similarities between cut and awk:
-
-++++
-
-++++
-
-.Items to submit
-====
-- One line of UNIX commands to solve the problem *without* using `awk`.
-- One line of UNIX commands to solve the problem using `awk`.
-- 1-2 sentences describing which method you prefer and why.
-====
-
-=== Question 2
-
-Write a bash script that accepts a year (1987, 1988, etc.) and a column *n* and returns the *nth* column of the associated year of data.
-
-Here are two examples to illustrate how to write a bash script:
-
-++++
-
-++++
-
-++++
-
-++++
-
-In this example, you only need to turn in the content of your bash script (starting with `#!/bin/bash`) without evaluation in a code chunk. However, you should test your script before submission to make sure it works. To actually test out your bash script, take the following example. The script is simple and just prints out the first two arguments given to it:
-
-```{bash, eval=F}
-#!/bin/bash
-echo "First argument: $1"
-echo "Second argument: $2"
-```
-
-If you simply drop that text into a file called `my_script.sh`, located here: `/home/$USER/my_script.sh`, and if you run the following:
-
-```{bash, eval=F}
-# Setup bash to run; this only needs to be run one time per session.
-# It makes bash behave a little more naturally in RStudio.
-exec bash
-# Navigate to the location of my_script.sh
-cd /home/$USER
-# Make sure that the script is runable.
-# This only needs to be done one time for each new script that you write.
-chmod 755 my_script.sh
-# Execute my_script.sh
-./my_script.sh okay cool
-```
-
-then it will print:
-
-----
-First argument: okay
-Second argument: cool
-----
-
-In this example, if we were to turn in the "content of your bash script (starting with `#!/bin/bash`) in a code chunk, our solution would look like this:
-
-```{bash, eval=F}
-#!/bin/bash
-echo "First argument: $1"
-echo "Second argument: $2"
-```
-
-And although we aren't running the code chunk above, we know that it works because we tested it in the terminal.
-
-[TIP]
-====
-Using `awk` you could have a script with just two lines: 1 with the "hash-bang" (`#!/bin/bash`), and 1 with a single `awk` command.
-====
-
-.Items to submit
-====
-- The content of your bash script (starting with `#!/bin/bash`) in a code chunk.
-====
-
-=== Question 3
-
-How many flights arrived at Indianapolis (IND) in 2008? First solve this problem without using `awk`, then solve this problem using *only* `awk`.
-
-Here is a similar example, using the election data set:
-
-++++
-
-++++
-
-.Items to submit
-====
-- One line of UNIX commands to solve the problem *without* using `awk`.
-- One line of UNIX commands to solve the problem using `awk`.
-- The number of flights that arrived at Indianapolis (IND) in 2008.
-====
-
-=== Question 4
-
-Do you expect the number of unique origins and destinations to be the same based on flight data in the year 2008? Find out, using any command line tool you'd like. Are they indeed the same? How many unique values do we have per category (`Origin`, `Dest`)?
-
-Here is an example to help you with the last part of the question, about Origin-to-Destination pairs. We analyze the city-state pairs from the election data:
-
-++++
-
-++++
-
-.Items to submit
-====
-- 1-2 sentences explaining whether or not you expect the number of unique origins and destinations to be the same.
-- The UNIX command(s) used to figure out if the number of unique origins and destinations are the same.
-- The number of unique values per category (`Origin`, `Dest`).
-====
-
-=== Question 5
-
-In (4) we found that there are not the same number of unique `Origin` as `Dest`. Find the https://en.wikipedia.org/wiki/International_Air_Transport_Association_code#Airport_codes[IATA airport code] for all `Origin` that don't appear in a `Dest` and all `Dest` that don't appear in an `Origin` in the 2008 data.
-
-[TIP]
-====
-The examples on https://www.tutorialspoint.com/unix_commands/comm.htm[this] page should help. Note that these examples are based on https://tldp.org/LDP/abs/html/process-sub.html[Process Substitution], which basically allows you to specify commands whose output would be used as the input of `comm`. There should be no space between the open bracket and open parenthesis, otherwise your bash will not work as intended.
-====
-
-.Items to submit
-====
-- The line(s) of UNIX command(s) used to answer the question.
-- The list of all `Origin` that don't appear in `Dest`.
-- The list of all `Dest` that don't appear in `Origin`.
-====
-
-=== Question 6
-
-What was the percentage of flights in 2008 per unique `Origin` with the `Dest` of "IND"? What percentage of flights had "PHX" as `Origin` (among all flights with `Dest` of "IND")?
-
-Here is an example using the percentages of donations contributed from CEOs from various States:
-
-++++
-
-++++
-
-[TIP]
-====
-You can do the mean calculation in awk by dividing the result from (3) by the number of unique `Origin` that have a `Dest` of "IND".
-====
-
-.Items to submit
-====
-- The percentage of flights in 2008 per unique `Origin` with the `Dest` of "IND".
-- 1-2 sentences explaining how "PHX" compares (as a unique `ORIGIN`) to the other `Origin`s (all with the `Dest` of "IND")?
-====
-
-=== Question 7
-
-Write a bash script that takes a year and IATA airport code and returns the year, and the total number of flights to and from the given airport. Example rows may look like:
-
-----
-1987, 12345
-1988, 44
-----
-
-Run the script with inputs: `1991` and `ORD`. Include the output in your submission.
-
-.Items to submit
-====
-- The content of your bash script (starting with "#!/bin/bash") in a code chunk.
-- The output of the script given `1991` and `ORD` as inputs.
-====
-
-=== OPTIONAL QUESTION 1
-
-Pick your favorite airport and get its IATA airport code. Write a bash script that, given the first year, last year, and airport code, runs the bash script from (7) for all years in the provided range for your given airport, or loops through all of the files for the given airport, appending all of the data to a new file called `my_airport.csv`.
-
-.Items to submit
-====
-- The content of your bash script (starting with "#!/bin/bash") in a code chunk.
-====
-
-=== OPTIONAL QUESTION 2
-
-In R, load `my_airport.csv` and create a line plot showing the year-by-year change. Label your x-axis "Year", your y-axis "Num Flights", and your title the name of the IATA airport code. Write 1-2 sentences with your observations.
-
-.Items to submit
-====
-- Line chart showing year-by-year change in flights into and out of the chosen airport.
-- R code used to create the chart.
-- 1-2 sentences with your observations.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project07.adoc
deleted file mode 100644
index 1d0623492..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project07.adoc
+++ /dev/null
@@ -1,153 +0,0 @@
-= STAT 39000: Project 7 -- Fall 2020
-
-**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential issues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
-
-**Context:** This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
-
-**Scope:** awk, UNIX utilities, bash scripts
-
-.Learning objectives
-****
-- Use `awk` to process and manipulate textual data.
-- Use piping and redirection within the terminal to pass around data between utilities.
-****
-
-== Dataset:
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/flights/subset/YYYY.csv`
-
-An example of the data for the year 1987 can be found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/1987.csv[here].
-
-Sometimes if you are about to dig into a dataset, it is good to quickly do some sanity checks early on to make sure the data is what you expect it to be.
-
-== Questions
-
-[IMPORTANT]
-====
-Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
-====
-
-[IMPORTANT]
-====
-Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
-====
-
-=== Question 1
-
-Write a line of code that prints a list of the unique values in the `DayOfWeek` column. Write a line of code that prints a list of the unique values in the `DayOfMonth` column. Write a line of code that prints a list of the unique values in the `Month` column. Use the `1987.csv` dataset. Are the results what you expected?
-
-.Items to submit
-====
-- 3 lines of code used to get a list of unique values for the chosen columns.
-- 1-2 sentences explaining whether or not the results are what you expected.
-====
-
-=== Question 2
-
-Our files should have 29 columns. For a given file, write a line of code that prints any lines that do *not* have 29 columns. Test it on `1987.csv`, were there any rows without 29 columns?
-
-[TIP]
-====
-Checking built-in variables for `awk`, we see that `NF` may be useful!
-====
-
-.Items to submit
-====
-- Line of code used to solve the problem.
-- 1-2 sentences explaining whether or not there were any rows without 29 columns.
-====
-
-=== Question 3
-
-Write a bash script that, given a "begin" year and "end" year, cycles through the associated files and prints any lines that do *not* have 29 columns.
-
-.Items to submit
-====
-- The content of your bash script (starting with "#!/bin/bash") in a code chunk.
-- The results of running your bash scripts from year 1987 to 2008.
-====
-
-=== Question 4
-
-`awk` is a really good tool to quickly get some data and manipulate it a little bit. The column `Distance` contains the distances of the flights in miles. Use `awk` to calculate the total distance traveled by the flights in 1990, and show the results in both miles and kilometers. To convert from miles to kilometers, simply multiply by 1.609344.
-
-Below is some example output:
-
-----
-Miles: 12345
-Kilometers: 19867.35168
-----
-
-.Items to submit
-====
-- The code used to solve the problem.
-- The results of running the code.
-====
-
-=== Question 5
-
-Use `awk` to calculate the sum of the number of `DepDelay` minutes, grouped according to `DayOfWeek`. Use `2007.csv`.
-
-Below is some example output:
-
-```txt
-DayOfWeek: 0
-1: 1234567
-2: 1234567
-3: 1234567
-4: 1234567
-5: 1234567
-6: 1234567
-7: 1234567
-```
-
-[NOTE]
-====
-1 is Monday.
-====
-
-.Items to submit
-====
-- The code used to solve the problem.
-- The output from running the code.
-====
-
-=== Question 6
-
-It wouldn't be fair to compare the total `DepDelay` minutes by `DayOfWeek` as the number of flights may vary. One way to take this into account is to instead calculate an average. Modify (5) to calculate the average number of `DepDelay` minutes by the number of flights per `DayOfWeek`. Use `2007.csv`.
-
-Below is some example output:
-
-```txt
-DayOfWeek: 0
-1: 1.234567
-2: 1.234567
-3: 1.234567
-4: 1.234567
-5: 1.234567
-6: 1.234567
-7: 1.234567
-```
-
-.Items to submit
-====
-- The code used to solve the problem.
-- The output from running the code.
-====
-
-=== Question 7
-
-Anyone who has flown knows how frustrating it can be waiting for takeoff, or deboarding the aircraft. These roughly translate to `TaxiOut` and `TaxiIn` respectively. If you were to fly into or out of IND what is your expected total taxi time? Use `2007.csv`.
-
-[NOTE]
-====
-Taxi times are in minutes.
-====
-
-.Items to submit
-====
-- The code used to solve the problem.
-- The output from running the code.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project08.adoc
deleted file mode 100644
index 8bbcb1036..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project08.adoc
+++ /dev/null
@@ -1,148 +0,0 @@
-= STAT 39000: Project 8 -- Fall 2020
-
-**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential issues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
-
-**Context:** This is the last part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
-
-**Scope:** awk, UNIX utilities, bash scripts
-
-.Learning objectives
-****
-- Use `awk` to process and manipulate textual data.
-- Use piping and redirection within the terminal to pass around data between utilities.
-****
-
-== Dataset:
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/flights/subset/YYYY.csv`
-
-An example of the data for the year 1987 can be found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/1987.csv[here].
-
-== Questions
-
-[IMPORTANT]
-====
-Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
-====
-
-[IMPORTANT]
-====
-Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
-====
-
-=== Question 1
-
-Let's say we have a theory that there are more flights on the weekend days (Friday, Saturday, Sunday) than the rest of the days, on average. We can use awk to quickly check it out and see if maybe this looks like something that is true!
-
-Write a line of `awk` code that, prints the _total_ number of flights that occur on weekend days, followed by the _total_ number of flights that occur on the weekdays. Complete this calculation for 2008 using the `2008.csv` file.
-
-Modify your code to instead print the average number of flights that occur on weekend days, followed by the average number of flights that occur on the weekdays.
-
-[TIP]
-====
-You don't need a large if statement to do this, you can use the `~` comparison operator.
-====
-
-.Items to submit
-====
-- Lines of `awk` code that solves the problem.
-- The result: the number of flights on the weekend days, followed by the number of flights on the weekdays for the flights during 2008.
-- The result: the average number of flights on the weekend days, followed by the average number of flights on the weekdays for the flights during 2008.
-====
-
-=== Question 2
-
-We want to look to see if there may be some truth to the whole "snow bird" concept where people will travel to warmer states like Florida and Arizona during the Winter. Let's use the tools we've learned to explore this a little bit.
-
-Take a look at `airports.csv`. In particular run the following:
-
-```{bash, eval=F}
-head airports.csv
-```
-
-Notice how all of the non-numeric text is surrounded by quotes. The surrounding quotes would need to be escaped for any comparison within `awk`. This is messy and we would prefer to create a new file called `new_airports.csv` without any quotes. Write a line of code to do this.
-
-[NOTE]
-====
-You may be wondering *why* we are asking you to do this. This sort of situation (where you need to deal with quotes) happens a lot! It's important to practice and learn ways to fix these things.
-====
-
-[TIP]
-====
-You could use `gsub` within `awk` to replace '"' with ''. You can find how to use `gsub` https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html[here].
-====
-
-[TIP]
-====
-If you leave out the column number argument to `gsub` it will apply the substitution to every field in every column.
-====
-
-[TIP]
-====
-```{bash, eval=F}
-cat new_airports.csv | wc -l # should be 159 without header
-```
-====
-
-.Items to submit
-====
-- Line of `awk` code used to create the new dataset.
-====
-
-=== Question 3
-
-Write a line of commands that creates a new dataset called `az_fl_airports.txt`. `az_fl_airports.txt` should _only_ contain a list of airport codes for all airports from both Arizona (AZ) and Florida (FL). Use the file we created in (3),`new_airports.csv` as a starting point.
-
-How many airports are there? Did you expect this? Use a line of bash code to count this.
-
-Create a new dataset (called `az_fl_flights.txt`) that contains all of the data for flights into or out of Florida and Arizona (using the `2008.csv` file). Use the newly created dataset, `az_fl_airports.txt` to accomplish this.
-
-[TIP]
-====
-https://unix.stackexchange.com/questions/293684/basic-grep-awk-help-extracting-all-lines-containing-a-list-of-terms-from-one-f
-====
-
-[TIP]
-====
-```{bash, eval=F}
-cat az_fl_flights.txt | wc -l # should be 484705
-```
-====
-
-.Items to submit
-====
-- All UNIX commands used to answer the questions.
-- The number of airports.
-- 1-2 sentences explaining whether you expected this number of airports.
-====
-
-=== Question 4
-
-Write a bash script that accepts the start year, end year, and filename containing airport codes (`az_fl_airports.txt`), and outputs the data for flights into or out of any of the airports listed in the provided filename (`az_fl_airports.txt`). The script should output data for flights using _all_ of the years of data in the provided range. Run the bash script to create a new file called `az_fl_flights_total.csv`.
-
-.Items to submit
-====
-- The content of your bash script (starting with "#!/bin/bash") in a code chunk.
-- The line of UNIX code you used to execute the script and create the new dataset.
-====
-
-=== Question 5
-
-Use the newly created dataset, `az_fl_flights_total.csv`, from question 4 to calculate the total number of flights into and out of both states by month, and by year, for a total of 3 columns (year, month, flights). Export this information to a new file called `snowbirds.csv`.
-
-Load up your newly created dataset and use either R or Python (or some other tool) to create a graphic that illustrates whether or not we believe the "snowbird effect" effects flights. Include a description of your graph, as well as your (anecdotal) conclusion.
-
-[TIP]
-====
-You can use 1 dimensional arrays to accomplish this if the key is the combination of, for example, the year and month.
-====
-
-.Items to submit
-====
-- The line of `awk` code used to create the new dataset, `snowbirds.csv`.
-- Code used to create the visualization in a code chunk.
-- The generated plot as either a png or jpg/jpeg.
-- 1-2 sentences describing your plot and your conclusion.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project09.adoc
deleted file mode 100644
index 36057d199..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project09.adoc
+++ /dev/null
@@ -1,290 +0,0 @@
-= STAT 39000: Project 9 -- Fall 2020
-
-**Motivation:** Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. In fact, https://www.cloudflare.com/[cloudflare], a billion dollar company, had much of its starting infrastructure built on top of a Postgresql database (per https://news.ycombinator.com/item?id=22878136[this thread on hackernews]). Learning SQL is _well_ worth your time!
-
-**Context:** There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we've spent much of this semester in the terminal, we will start in the terminal using SQLite.
-
-**Scope:** SQL, sqlite
-
-.Learning objectives
-****
-- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
-- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
-- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/lahman/lahman.db`
-
-This is the Lahman Baseball Database. You can find its documentation http://www.seanlahman.com/files/database/readme2017.txt[here], including the definitions of the tables and columns.
-
-== Questions
-
-[IMPORTANT]
-====
-Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
-====
-
-[IMPORTANT]
-====
-Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
-====
-
-[IMPORTANT]
-====
-For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd
-====
-
-````markdown
-```{r, include=F}`r ''`
-library(RSQLite)
-lahman <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/lahman/lahman.db")
-```
-````
-
-Each solution should then be placed in a code chunk like this:
-
-````markdown
-```{sql, connection=lahman}`r ''`
-SELECT * FROM batting LIMIT 1;
-```
-````
-
-If you want to use a SQLite-specific function like `.tables` (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:
-
-[source,bash]
-----
-sqlite3 /class/datamine/data/lahman/lahman.db
-----
-
-From there, the SQLite-specific commands will function properly. They will _not_ function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option `eval=F` like this:
-
-````markdown
-```{sql, connection=lahman, eval=F}`r ''`
-SELECT * FROM batting LIMIT 1;
-```
-````
-
-This will allow the code to be displayed without throwing an error.
-
-=== Question 1
-
-Connect to RStudio Server https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal and access the Lahman database. How many tables are available?
-
-[TIP]
-====
-To connect to the database, do the following:
-====
-
-```{bash, eval=F}
-sqlite3 /class/datamine/data/lahman/lahman.db
-```
-
-[TIP]
-====
-https://database.guide/2-ways-to-list-tables-in-sqlite-database/[This] is a good resource.
-====
-
-.Items to submit
-====
-- How many tables are available in the Lahman database?
-- The sqlite3 commands used to figure out how many tables are available.
-====
-
-=== Question 2
-
-Some people like to try to https://www.washingtonpost.com/graphics/2017/sports/how-many-mlb-parks-have-you-visited/[visit all 30 MLB ballparks] in their lifetime. Use SQL commands to get a list of `parks` and the cities they're located in. For your final answer, limit the output to 10 records/rows.
-
-[NOTE]
-====
-There may be more than 30 parks in your result, this is ok. For long results, you can limit the number of printed results using the `LIMIT` clause.
-====
-
-[TIP]
-====
-Make sure you take a look at the column names and get familiar with the data tables. If working from the Terminal, to see the header row as a part of each query result, run the following:
-
-[source,SQL]
-----
-.headers on
-----
-====
-
-.Items to submit
-====
-- SQL code used to solve the problem.
-- The first 10 results of the query.
-====
-
-=== Question 3
-
-There is nothing more exciting to witness than a home run hit by a batter. It's impressive if a player hits more than 40 in a season. Find the hitters who have hit 60 or more home runs (`HR`) in a season. List their `playerID`, `yearID`, home run total, and the `teamID` they played for.
-
-[TIP]
-====
-There are 8 occurrences of home runs greater than or equal to 60.
-====
-
-[TIP]
-====
-The `batting` table is where you should look for this question.
-====
-
-.Items to submit
-====
-- SQL code used to solve the problem.
-- The first 10 results of the query.
-====
-
-=== Question 4
-
-Make a list of players born on your birth day (don't worry about the year). Display their first names, last names, and birth year. Order the list descending by their birth year.
-
-[TIP]
-====
-The `people` table is where you should look for this question.
-====
-
-[NOTE]
-====
-Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here].
-====
-
-.Items to submit
-====
-- SQL code used to solve the problem.
-- The first 10 results of the query.
-====
-
-=== Question 5
-
-Get the Cleveland (CLE) Pitching Roster from the 2016 season (`playerID`, `W`, `L`, `SO`). Order the pitchers by number of Strikeouts (SO) in descending order.
-
-[TIP]
-====
-The `pitching` table is where you should look for this question.
-====
-
-[NOTE]
-====
-Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here].
-====
-
-.Items to submit
-====
-- SQL code used to solve the problem.
-- The first 10 results of the query.
-====
-
-=== Question 6
-
-Find the 10 team and year pairs that have the most number of Errors (`E`) between 1960 and 1970. Display their Win and Loss counts too. What is the name of the team that appears in 3rd place in the ranking of the team and year pairs?
-
-[TIP]
-====
-The `teams` table is where you should look for this question.
-====
-
-[TIP]
-====
-The `BETWEEN` clause is useful here.
-====
-
-[TIP]
-====
-It is OK to use multiple queries to answer the question.
-====
-
-[NOTE]
-====
-Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here].
-====
-
-.Items to submit
-====
-- SQL code used to solve the problem.
-- The first 10 results of the query.
-====
-
-=== Question 7
-
-Find the `playerID` for Bob Lemon. What year and team was he on when he got the most wins as a pitcher (use table `pitching`)? What year and team did he win the most games as a manager (use table `managers`)?
-
-[TIP]
-====
-It is OK to use multiple queries to answer the question.
-====
-
-[NOTE]
-====
-There was a tie among the two years in which Bob Lemon had the most wins as a pitcher.
-====
-
-[NOTE]
-====
-Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here].
-====
-
-.Items to submit
-====
-- SQL code used to solve the problem.
-- The first 10 results of the query.
-====
-
-=== Question 8
-
-For the https://en.wikipedia.org/wiki/American_League_West[AL West] (use `lgID` and `divID` to specify this), find the home run (`HR`), walk (`BB`), and stolen base (`SB`) totals by team between 2000 and 2010. Which team and year combo led in each category in the decade?
-
-[TIP]
-====
-The `teams` table is where you should look for this question.
-====
-
-[TIP]
-====
-It is OK to use multiple queries to answer the question.
-====
-
-[TIP]
-====
-Use `divID == 'W'` as one of the conditions. Please note using double quotes: `divID == "W"` will not work.
-====
-
-[NOTE]
-====
-Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here].
-====
-
-.Items to submit
-====
-- SQL code used to solve the problem.
-- The team-year combination that ranked top in each category.
-====
-
-=== Question 9
-
-Get a list of the following by year: wins (`W`), losses (`L`), Home Runs Hit (`HR`), homeruns allowed (`HRA`), and total home game attendance (`attendance`) for the Detroit Tigers when winning a World Series (`WSWin` is `Y`) or when winning league champion (`LgWin` is `Y`).
-
-[TIP]
-====
-The `teams` table is where you should look for this question.
-====
-
-[TIP]
-====
-Be careful with the order of operations for `AND` and `OR`. Remember you can force order of operations using parentheses.
-====
-
-[NOTE]
-====
-Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here].
-====
-
-.Items to submit
-====
-- SQL code used to solve the problem.
-- The first 10 results of the query.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project10.adoc
deleted file mode 100644
index 6aac2e842..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project10.adoc
+++ /dev/null
@@ -1,200 +0,0 @@
-= STAT 39000: Project 10 -- Fall 2020
-
-**Motivation:** Although SQL syntax may still feel unnatural and foreign, with more practice it _will_ start to make more sense. The ability to read and write SQL queries is a bread-and-butter skill for anyone working with data.
-
-**Context:** We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like `AVG`, `MIN`, and `MAX`.
-
-**Scope:** SQL, sqlite
-
-.Learning objectives
-****
-- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
-- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
-- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.
-- Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems.
-****
-
-== Dataset
-
-The following questions will use the dataset similar to the one from Project 9, but this time we will use a MariaDB version of the database, which is also hosted on Scholar, at `scholar-db.rcac.purdue.edu`.
-As in Project 9, this is the Lahman Baseball Database. You can find its documentation http://www.seanlahman.com/files/database/readme2017.txt[here], including the definitions of the tables and columns.
-
-== Questions
-
-[IMPORTANT]
-====
-Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
-====
-
-[IMPORTANT]
-====
-Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
-====
-
-[IMPORTANT]
-====
-For this project all solutions should be done using R code chunks, and the `RMariaDB` package. Run the following code to load the library:
-
-[source,r]
-----
-library(RMariaDB)
-----
-====
-
-=== Question 1
-
-Connect to RStudio Server https://rstudio.scholar.rcac.purdue.edu, and, rather than navigating to the terminal like we did in the previous project, instead, create a connection to our MariaDB lahman database using the `RMariaDB` package in R, and the credentials below. Confirm the connection by running the following code chunk:
-
-[source,r]
-----
-con <- dbConnect(RMariaDB::MariaDB(),
- host="scholar-db.rcac.purdue.edu",
- db="lahmandb",
- user="lahman_user",
- password="HitAH0merun")
-head(dbGetQuery(con, "SHOW tables;"))
-----
-
-[TIP]
-====
-In the example provided, the variable `con` from the `dbConnect` function is the connection. Each query that you make, using the `dbGetQuery`, needs to use this connection `con`. You can change the name `con` if you want to (it is user defined), but if you change the name `con`, you need to change it on all of your connections. If your connection to the database dies while you are working on the project, you can always re-run the `dbConnect` line again, to reset your connection to the database.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- Output from running your (potentially modified) `head(dbGetQuery(con, "SHOW tables;"))`.
-====
-
-=== Question 2
-
-How many players are members of the 40/40 club? These are players that have stolen at least 40 bases (`SB`) and hit at least 40 home runs (`HR`) in one year.
-
-[TIP]
-====
-Use the `batting` table.
-====
-
-[IMPORTANT]
-====
-You only need to run `library(RMariaDB)` and the `dbConnect` portion of the code a single time towards the top of your project. After that, you can simply reuse your connection `con` to run queries.
-====
-
-[IMPORTANT]
-====
-In our xref:templates.adoc[project template], for this project, make all of the SQL queries using the `dbGetQuery` function, which returns the results directly in `R`. Therefore, your `RMarkdown` blocks for this project should all be `{r}` blocks (as opposed to the `{sql}` blocks used in Project 9).
-====
-
-[TIP]
-====
-You can use `dbGetQuery` to run your queries from within R. Example:
-
-[source,r]
-----
-dbGetQuery(con, "SELECT * FROM battings LIMIT 5;")
-----
-====
-
-[NOTE]
-====
-We already demonstrated the correct SQL query to use for the 40/40 club in the video below, but now we want you to use `RMariaDB` to solve this query.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result of running the R code.
-====
-
-=== Question 3
-
-How many times in total has Giancarlo Stanton struck out in years in which he played for "MIA" or "FLO"?
-
-++++
-
-++++
-
-[IMPORTANT]
-====
-Questions in this project need to be solved using SQL when possible. You will not receive credit for a question if you use `sum` in R rather than `SUM` in SQL.
-====
-
-[TIP]
-====
-Use the `people` table to find the `playerID` and use the `batting` table to find the statistics.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result of running the R code.
-====
-
-=== Question 4
-
-The https://en.wikipedia.org/wiki/Batting_average_(baseball)[Batting Average] is a metric for a batter's performance. The Batting Average in a year is calculated by stem:[\frac{H}{AB}] (the number of hits divided by at-bats). Considering (only) the years between 2000 and 2010, calculate the (seasonal) Batting Average for each batter who had more than 300 at-bats in a season. List the top 5 batting averages next to `playerID`, `teamID`, and `yearID.`
-
-[TIP]
-====
-Use the `batting` table.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result of running the R code.
-====
-
-=== Question 5
-
-How many unique players have hit > 50 home runs (`HR`) in a season?
-
-[TIP]
-====
-If you view `DISTINCT` as being paired with `SELECT`, instead, think of it as being paired with one of the fields you are selecting.
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result of running the R code.
-====
-
-=== Question 6
-
-Find the number of unique players that attended Purdue University. Start by finding the `schoolID` for Purdue and then find the number of players who played there. Do the same for IU. Who had more? Purdue or IU? Use the information you have in the database, and the power of R to create a misleading graphic that makes Purdue look better than IU, even if just at first glance. Make sure you label the graphic.
-
-[TIP]
-====
-Use the `schools` table to get all `schoolID` and the `collegeplaying` table to get the statistics.
-====
-
-[TIP]
-====
-You can mess with the scale of the y-axis. You could (potentially) filter the data to start from a certain year or be between two dates.
-====
-
-[TIP]
-====
-To find IU's id, try the following query: `SELECT schoolID FROM schools WHERE name_full LIKE '%indiana%';`. You can find more about the LIKE clause and `%` https://www.tutorialspoint.com/sql/sql-like-clause.htm[here].
-====
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result of running the R code.
-====
-
-=== Question 7
-
-Use R, SQL and the lahman database to create an interesting infographic. For those of you who are not baseball fans, try doing a Google image search for "baseball plots" for inspiration. Make sure the plot is polished, has appropriate labels, color, etc.
-
-.Items to submit
-====
-- R code used to solve the problem.
-- The result of running the R code.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project11.adoc
deleted file mode 100644
index 439fe0043..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project11.adoc
+++ /dev/null
@@ -1,227 +0,0 @@
-= STAT 39000: Project 11 -- Fall 2020
-
-**Motivation:** Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like MIN, MAX, and AVG in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values.
-
-**Context:** We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values using a much larger dataset!
-
-**Scope:** SQL, SQL in R
-
-.Learning objectives
-****
-- Demonstrate the ability to interact with popular database management systems within R.
-- Solve data-driven problems using a combination of SQL and R.
-- Basic clauses: SELECT, ORDER BY, LIMIT, DESC, ASC, COUNT, WHERE, FROM, etc.
-- Showcase the ability to filter, alias, and write subqueries.
-- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. Explain when to use having, and when to use where.
-****
-
-== Dataset
-
-The following questions will use the `elections` database. Similar to Project 10, this database is hosted on Scholar. Moreover, Question 1 also involves the following data files found in Scholar:
-
-`/class/datamine/data/election/itcontYYYY.txt` (for example, data for year 1980 would be `/class/datamine/data/election/itcont1980.txt`)
-
-A public sample of the data can be found here:
-
-https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcontYYYY.txt (for example, data for year 1980 would be https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcont1980.txt)
-
-== Questions
-
-[IMPORTANT]
-====
-For this project you will need to connect to the database `elections` using the `RMariaDB` package in R. Include the following code chunk in the beginning of your RMarkdown file:
-
-````markdown
-```{r setup-database-connection}`r ''`
-library(RMariaDB)
-con <- dbConnect(RMariaDB::MariaDB(),
- host="scholar-db.rcac.purdue.edu",
- db="elections",
- user="elections_user",
- password="Dataelect!98")
-```
-````
-====
-
-When a question involves SQL queries in this project, you may use a SQL code chunk (with `{sql}`), or an R code chunk (with `{r}`) and functions like `dbGetQuery` as you did in Project 10. Please refer to Question 5 in the xref:åtemplates.adoc[project template] for examples.
-
-=== Question 1
-
-Approximately how large was the lahman database (use the sqlite database in Scholar: `/class/datamine/data/lahman/lahman.db`)? Use UNIX utilities you've learned about this semester to write a line of code to return the size of that .db file (in MB).
-
-The data we consider in this project are much larger. Use UNIX utilities (bash and awk) to write another line of code that calculates the total amount of data in the elections folder `/class/datamine/data/election/`. How much data (in MB) is there?
-
-The data in that folder has been added to the `elections` database, all aggregated in the `elections` table. Write a SQL query that returns the number of rows of data are in the database. How many rows of data are in the table `elections`?
-
-[NOTE]
-====
-These are some examples of how to get the sizes of collections of files in UNIX:
-====
-
-++++
-
-++++
-
-[TIP]
-====
-The SQL query will take some time! Be patient.
-====
-
-[NOTE]
-====
-You may use more than one code chunk in your RMarkdown file for the different tasks.
-====
-
-[NOTE]
-====
-We will accept values that represent either apparent or allocated size, as well as estimated disk usage. To get the size from `ls` and `du` to match, use the `--apparent-size` option with `du`.
-====
-
-[NOTE]
-====
-A Megabyte (MB) is actually 1000^2 bytes, not 1024^2. A Mebibyte (MiB) is 1024^2 bytes. See https://en.wikipedia.org/wiki/Gigabyte[here] for more information. For this question, either solution will be given full credit. https://thedatamine.github.io/the-examples-book/unix.html#why-is-the-result-of-du--b-.metadata.csv-divided-by-1024-not-the-result-of-du--k-.metadata.csv[This] is a potentially useful example.
-====
-
-.Items to submit
-====
-- Line of code (bash/awk) to show the size (in MB) of the lahman database file.
-- Approximate size of the lahman database in MB.
-- Line of code (bash/awk) to calculate the size (in MB) of the entire elections dataset in `/class/datamine/data/election`.
-- The size of the elections data in MB.
-- SQL query used to find the number of rows of data in the `elections` table in the `elections` database.
-- The number of rows in the `elections` table in the `elections` database.
-====
-
-=== Question 2
-
-Write a SQL query using the `LIKE` command to find a unique list of `zip_code` that start with "479".
-
-Write another SQL query and answer: How many unique `zip_code` are there that begin with "479"?
-
-[NOTE]
-====
-Here are some examples about SQL that might be relevant for Questions 2 and 3 in this project.
-====
-
-++++
-
-++++
-
-[TIP]
-====
-The first query returns a list of zip codes, and the second returns a count.
-====
-
-[TIP]
-====
-Make sure you only select `zip_code`.
-====
-
-.Items to submit
-====
-- SQL queries used to answer the question.
-- The first 5 results from running the query.
-====
-
-=== Question 3
-
-Write a SQL query that counts the number of donations (rows) that are from Indiana. How many donations are from Indiana? Rewrite the query and create an _alias_ for our field so it doesn't read `COUNT(*)` but rather `Indiana Donations`.
-
-[TIP]
-====
-You may enclose an alias's name in quotation marks (single or double) when the name contains space.
-====
-
-.Items to submit
-====
-- SQL query used to answer the question.
-- The result of the SQL query.
-====
-
-=== Question 4
-
-Rewrite the query in (3) so the result is displayed like: `IN: 1234567`. Note, if instead of "IN" we wanted "OH", only the WHERE clause should be modified, and the display should automatically change to `OH: 1234567`. In other words, the state abbreviation should be dynamic, not static.
-
-[NOTE]
-====
-This video demonstrates how to use CONCAT in a MySQL query:
-====
-
-++++
-
-++++
-
-[TIP]
-====
-Use CONCAT and aliasing to accomplish this.
-====
-
-[TIP]
-====
-Remember, `state` contains the state abbreviation.
-====
-
-.Items to submit
-====
-- SQL query used to answer the question.
-====
-
-=== Question 5
-
-In (2) we wrote a query that returns a unique list of zip codes that start with "479". In (3) we wrote a query that counts the number of donations that are from Indiana. Use our query from (2) as a sub-query to find how many donations come from areas with zip codes starting with "479". What percent of donations in Indiana come from said zip codes?
-
-[NOTE]
-====
-This video gives two examples of sub-queries:
-====
-
-++++
-
-++++
-
-[TIP]
-====
-You can simply manually calculate the percent using the count in (2) and (5).
-====
-
-.Items to submit
-====
-- SQL queries used to answer the question.
-- The percentage of donations from Indiana from `zip_code`s starting with "479".
-====
-
-=== Question 6
-
-In (3) we wrote a query that counts the number of donations that are from Indiana. When running queries like this, a natural "next question" is to ask the same question about another state. SQL gives us the ability to calculate functions in aggregate when grouping by a certain column. Write a SQL query that returns the state, number of donations from each state, the sum of the donations (`transaction_amt`). Which 5 states gave the most donations (highest count)? Order you result from most to least.
-
-[NOTE]
-====
-In this video we demonstrate `GROUP BY`, `ORDER BY`, `DESC`, and other aspects of MySQL that might help with this question:
-====
-
-++++
-
-++++
-
-[TIP]
-====
-You may want to create an alias in order to sort.
-====
-
-.Items to submit
-====
-- SQL query used to answer the question.
-- Which 5 states gave the most donations?
-====
-
-=== Question 7
-
-Write a query that gets the number of donations, and sum of donations, by year, for Indiana. Create one or more graphics that highlights the year-by-year changes. Write a short 1-2 sentences explaining your graphic(s).
-
-.Items to submit
-====
-- SQL query used to answer the question.
-- R code used to create your graphic(s).
-- 1 or more graphics in png/jpeg format.
-- 1-2 sentences summarizing your graphic(s).
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project12.adoc
deleted file mode 100644
index 9764724b7..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project12.adoc
+++ /dev/null
@@ -1,186 +0,0 @@
-= STAT 39000: Project 12 -- Fall 2020
-
-**Motivation:** Databases are comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform joins! In this project we will explore learn about and practice using joins on a database containing bike trip information from the Bay Area Bike Share.
-
-**Context:** We've introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables.
-
-**Scope:** SQL, sqlite, joins
-
-.Learning objectives
-****
-- Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem.
-- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING.
-- Showcase the ability to filter, alias, and write subqueries.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db`
-
-A public sample of the data can be found https://www.datadepot.rcac.purdue.edu/datamine/data/bay_area_bike_share/bay_area_bike_share.db[here].
-
-[IMPORTANT]
-====
-For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd:
-
-````markdown
-```{r, include=F}`r ''`
-library(RSQLite)
-bikeshare <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db")
-```
-````
-
-Each solution should then be placed in a code chunk like this:
-
-````markdown
-```{sql, connection=bikeshare}`r ''`
-SELECT * FROM station LIMIT 5;
-```
-````
-====
-
-If you want to use a SQLite-specific function like `.tables` (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:
-
-```{bash, eval=F}
-sqlite3 /class/datamine/data/bay_area_bike_share/bay_area_bike_share.db
-```
-
-From there, the SQLite-specific commands will function properly. They will _not_ function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option `eval=F` like this:
-
-````markdown
-```{sql, connection=bikeshare, eval=F}`r ''`
-SELECT * FROM station LIMIT 5;
-```
-````
-
-This will allow the code to be displayed without throwing an error.
-
-There are a variety of ways to join data using SQL. With that being said, if you are able to understand and use a LEFT JOIN and INNER JOIN, you can perform *all* of the other types of joins (RIGHT JOIN, FULL OUTER JOIN).
-
-== Questions
-
-=== Question 1
-
-Aliases can be created for tables, fields, and even results of aggregate functions (like MIN, MAX, COUNT, AVG, etc.). In addition, you can combine fields using the `sqlite` concatenate operator `||` see https://www.sqlitetutorial.net/sqlite-string-functions/sqlite-concat/[here]. Write a query that returns the first 5 records of information from the `station` table formatted in the following way:
-
-`(id) name @ (lat, long)`
-
-For example:
-
-`(84) Ryland Park @ (37.342725, -121.895617)`
-
-[TIP]
-====
-Here is a video about how to concatenate strings in SQLite.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- SQL query used to solve this problem.
-- The first 5 records of information from the `station` table.
-====
-
-=== Question 2
-
-There is a variety of interesting weather information in the `weather` table. Write a query that finds the average `mean_temperature_f` by `zip_code`. Which is on average the warmest `zip_code`?
-
-Use aliases to format the result in the following way:
-
-```{txt}
-Zip Code|Avg Temperature
-94041|61.3808219178082
-```
-Note that this is the output if you use `sqlite` in the terminal. While the output in your knitted pdf file may look different, you should name the columns accordingly.
-
-[TIP]
-====
-Here is a video about GROUP BY, ORDER BY, DISTINCT, and COUNT
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- SQL query used to solve this problem.
-- The results of the query copy and pasted.
-====
-
-=== Question 3
-
-From (2) we can see that there are only 5 `zip_code`s with weather information. How many unique `zip_code`s do we have in the `trip` table? Write a query that finds the number of unique `zip_code`s in the `trip` table. Write another query that lists the `zip_code` and count of the number of times the `zip_code` appears. If we had originally assumed that the `zip_code` was related to the location of the trip itself, we were wrong. Can you think of a likely explanation for the unexpected `zip_code` values in the `trip` table?
-
-[TIP]
-====
-There could be missing values in `zip_code`. We want to avoid them in SQL queries, for now. You can learn more about the missing values (or NULL) in SQL https://www.w3schools.com/sql/sql_null_values.asp[here].
-====
-
-.Items to submit
-====
-- SQL queries used to solve this problem.
-- 1-2 sentences explainging what a possible explanation for the `zip_code`s could be.
-====
-
-=== Question 4
-
-In (2) we wrote a query that finds the average `mean_temperature_f` by `zip_code`. What if we want to tack on our results in (2) to information from each row in the `station` table based on the `zip_code`? To do this, use an INNER JOIN. INNER JOIN combines tables based on specified fields, and returns only rows where there is a match in both the "left" and "right" tables.
-
-[TIP]
-====
-Use the query from (2) as a sub query within your solution.
-====
-
-[TIP]
-====
-Here is a video about JOIN and LEFT JOIN.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- SQL query used to solve this problem.
-====
-
-=== Question 5
-
-In (3) we alluded to the fact that many `zip_code` in the `trip` table aren't very consistent. Users can enter a zip code when using the app. This means that `zip_code` can be from anywhere in the world! With that being said, if the `zip_code` is one of the 5 `zip_code` for which we have weather data (from question 2), we can add that weather information to matching rows of the `trip` table. In (4) we used an INNER JOIN to append some weather information to each row in the `station` table. For this question, write a query that performs an INNER JOIN and appends weather data from the `weather` table to the trip data from the `trip` table. Limit your output to 5 lines.
-
-[IMPORTANT]
-====
-Notice that the weather data has about 1 row of weather information for each date and each zip code. This means you may have to join your data based on multiple constraints instead of just 1 like in (4). In the `trip` table, you can use `start_date` for for the date information.
-====
-
-[TIP]
-====
-You will want to wrap your dates and datetimes in https://www.sqlitetutorial.net/sqlite-date-functions/sqlite-date-function/[sqlite's `date` function] prior to comparison.
-====
-
-.Items to submit
-====
-- SQL query used to solve this problem.
-- First 5 lines of output.
-====
-
-=== Question 6
-
-How many rows are in the result from (5) (when not limiting to 5 lines)? How many rows are in the `trip` table? As you can see, a large proportion of the data from the `trip` table did not match the data from the `weather` table, and therefore was removed from the result. What if we want to keep all of the data from the `trip` table and add on data from the `weather` table if we have a match? Write a query to accomplish this. How many rows are in the result?
-
-.Items to submit
-====
-- SQL query used to find how many rows from the result in (5).
-- The number of rows in the result of (5).
-- SQL query to find how many rows are in the `trip` table.
-- The number of rows in the `trip` table.
-- SQL query to keep all of the data from the `trip` table and add on matching data from the `weather` table when available.
-- The number of rows in the result.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project13.adoc
deleted file mode 100644
index 5124e4c19..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project13.adoc
+++ /dev/null
@@ -1,180 +0,0 @@
-= STAT 39000: Project 13 -- Fall 2020
-
-**Motivation:** Databases you will work with won't necessarily come organized in the way that you like. Getting really comfortable writing longer queries where you have to perform many joins, alias fields and tables, and aggregate results, is important. In addition, gaining some familiarity with terms like _primary key_, and _foreign key_ will prove useful when you need to search for help online. In this project we will write some more complicated queries with a fun database. Proper preparation prevents poor performance, and that means practice!
-
-**Context:** We are towards the end of a series of projects that give you an opportunity to practice using SQL. In this project, we will reinforce topics you've already learned, with a focus on subqueries and joins.
-
-**Scope:** SQL, sqlite
-
-.Learning objectives
-****
-- Write and run SQL queries in `sqlite` on real-world data.
-- Identify primary and foreign keys in a SQL table.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/movies_and_tv/imdb.db`
-
-[IMPORTANT]
-====
-For this project you will use SQLite to access the data. To connect to the database, copy and paste the following before your solutions in your .Rmd:
-
-````markdown
-```{r, include=F}`r ''`
-library(RSQLite)
-imdb <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/movies_and_tv/imdb.db")
-```
-````
-====
-
-If you want to use a SQLite-specific function like `.tables` (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:
-
-```{bash, eval=F}
-sqlite3 /class/datamine/data/movies_and_tv/imdb.db
-```
-
-From there, the SQLite-specific commands will function properly. They will _not_ function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option `eval=F` like this:
-
-````markdown
-```{sql, connection=imdb, eval=F}`r ''`
-SELECT * FROM titles LIMIT 5;
-```
-````
-
-This will allow the code to be displayed without throwing an error.
-
-== Questions
-
-=== Question 1
-
-A primary key is a field in a table which uniquely identifies a row in the table. Primary keys _must_ be unique values, and this is enforced at the database level. A foreign key is a field whose value matches a primary key in a different table. A table can have 0-1 primary key, but it can have 0+ foreign keys. Examine the `titles` table. Do you think there are any primary keys? How about foreign keys? Now examine the `episodes` table. Based on observation and the column names, do you think there are any primary keys? How about foreign keys?
-
-[TIP]
-====
-A primary key can also be a foreign key.
-====
-
-[TIP]
-====
-Here are two videos. The first video will remind you how to find the names of all of the tables in the `imdb` database. The second video will introduce you to the `titles` and `episodes` tables in the `imdb` database.
-====
-
-++++
-
-++++
-
-++++
-
-++++
-
-.Items to submit
-====
-- List any primary or foreign keys in the `titles` table.
-- List any primary or foreign keys in the `episodes` table.
-====
-
-=== Question 2
-
-If you paste a `title_id` to the end of the following url, it will pull up the page for the title. For example, https://www.imdb.com/title/tt0413573 leads to the page for the TV series _Grey's Anatomy_. Write a SQL query to confirm that the `title_id` tt0413573 does indeed belong to _Grey's Anatomy_. Then browse imdb.com and find your favorite TV show. Get the `title_id` from the url of your favorite TV show and run the following query, to confirm that the TV show is in our database:
-
-[source,SQL]
-----
-SELECT * FROM titles WHERE title_id='';
-----
-
-Make sure to replace "" with the `title_id` of your favorite show. If your show does not appear, or has only a single season, pick another show until you find one we have in our database with multiple seasons.
-
-.Items to submit
-====
-- SQL query used to confirm that `title_id` tt0413573 does indeed belong to _Grey's Anatomy_.
-- The output of the query.
-- The `title_id` of your favorite TV show.
-- SQL query used to confirm the `title_id` for your favorite TV show.
-- The output of the query.
-====
-
-=== Question 3
-
-The `episode_title_id` column in the `episodes` table references titles of individual episodes of a TV series. The `show_title_id` references the titles of the show itself. With that in mind, write a query that gets a list of all `episodes_title_id` (found in the `episodes` table), with the associated `primary_title` (found in the `titles` table) for each episode of _Grey's Anatomy_.
-
-[TIP]
-====
-This video shows how to extract titles of episodes in the `imdb` database.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- SQL query used to solve the problem in a code chunk.
-====
-
-=== Question 4
-
-We want to write a query that returns the title and rating of the highest rated episode of your favorite TV show, which you chose in (2). In order to do so, we will break the task into two parts in (4) and (5). First, write a query that returns a list of `episode_title_id` (found in the `episodes` table), with the associated `primary_title` (found in the `titles` table) for each episode.
-
-[TIP]
-====
-This part is just like question (3) but this time with your favorite TV show, which you chose in (2).
-====
-
-[TIP]
-====
-This video shows how to use a subquery, to `JOIN` a total of three tables in the `imdb` database.
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- SQL query used to solve the problem in a code chunk.
-- The first 5 results from your query.
-====
-
-=== Question 5
-
-Write a query that adds the rating to the end of each episode. To do so, use the query you wrote in (4) as a subquery. Which episode has the highest rating? Is it also your favorite episode?
-
-[NOTE]
-====
-Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here].
-====
-
-.Items to submit
-====
-- SQL query used to solve the problem in a code chunk.
-- The `episode_title_id`, `primary_title`, and `rating` of the top rated episode from your favorite TV series, in question (2).
-- A statement saying whether the highest rated episode is also your favorite episode.
-====
-
-
-=== Question 6
-
-Write a query that returns the `season_number` (from the `episodes` table), and average `rating` (from the `ratings` table) for each season of your favorite TV show from (2). Write another query that only returns the season number and `rating` for the highest rated season. Consider the highest rated season the season with the highest average.
-
-.Items to submit
-====
-- The 2 SQL queries used to solve the problems in two code chunks.
-====
-
-=== Question 7
-
-Write a query that returns the `primary_title` and `rating` of the highest rated episode per season for your favorite TV show from question (2).
-
-[NOTE]
-====
-You can show one highest rated episode for each season, without the need to worry about ties.
-====
-
-.Items to submit
-====
-- The SQL query used to solve the problem.
-- The output from your query.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project14.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project14.adoc
deleted file mode 100644
index 7a685957f..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project14.adoc
+++ /dev/null
@@ -1,169 +0,0 @@
-= STAT 39000: Project 14 -- Fall 2020
-
-**Motivation:** As we learned earlier in the semester, bash scripts are a powerful tool when you need to perform repeated tasks in a UNIX-like system. In addition, sometimes preprocessing data using UNIX tools prior to analysis in R or Python is useful. Ample practice is integral to becoming proficient with these tools. As such, we will be reviewing topics learned earlier in the semester.
-
-**Context:** We've just ended a series of projects focused on SQL. In this project we will begin to review topics learned throughout the semester, starting writing bash scripts using the various UNIX tools we learned about in Projects 3 through 8.
-
-**Scope:** awk, UNIX utilities, bash scripts, fread
-
-.Learning objectives
-****
-- Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.
-- Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.
-- Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.
-- Use grep to search files effectively.
-- Use cut to section off data from the command line.
-- Use piping to string UNIX commands together.
-- Use awk for data extraction, and preprocessing.
-- Create bash scripts to automate a process or processes.
-****
-
-== Dataset
-
-The following questions will use ENTIRE_PLOTSNAP.csv from the data folder found in Scholar:
-
-`/anvil/projects/tdm/data/forest/`
-
-To read more about ENTIRE_PLOTSNAP.csv that you will be working with:
-
-https://www.uvm.edu/femc/data/archive/project/federal-forest-inventory-analysis-data-for/dataset/plot-level-data-gathered-through-forest/metadata#fields
-
-== Questions
-
-=== Question 1
-
-Take a look at at `ENTIRE_PLOTSNAP.csv`. Write a line of awk code that displays the `STATECD` followed by the number of rows with that `STATECD`.
-
-.Items to submit
-====
-- Code used to solve the problem.
-- Count of the following `STATECD`s: 1, 2, 4, 5, 6
-====
-
-=== Question 2
-
-Unfortunately, there isn't a very accessible list available that shows which state each `STATECD` represents. This is no problem for us though, the dataset has `LAT` and `LON`! Write some bash that prints just the `STATECD`, `LAT`, and `LON`.
-
-[NOTE]
-====
-There are 92 columns in our dataset: `awk -F, 'NR==1{print NF}' ENTIRE_PLOTSNAP.csv`. To create a list of `STATECD` to state, we only really need `STATECD`, `LAT`, and `LON`. Keeping the other 89 variables will keep our data at 2.6gb.
-====
-
-.Items to submit
-====
-- Code used to solve the problem.
-- The output of your code piped to `head`.
-====
-
-=== Question 3
-
-`fread` is a "Fast and Friendly File Finagler". It is part of the very popular `data.table` package in R. We will learn more about this package next semester. For now, read the documentation https://www.rdocumentation.org/packages/data.table/versions/1.12.8/topics/fread[here] and use the `cmd` argument in conjunction with your bash code from (2) to read the data of `STATECD`, `LAT`, and `LON` into a `data.table` in your R environment.
-
-.Items to submit
-====
-- Code used to solve the problem.
-- The `head` of the resulting `data.table`.
-====
-
-=== Question 4
-
-We are going to further understand the data from question (3) by finding the actual locations based on the `LAT` and `LON` columns. We can use the library `revgeo` to get a location given a pair of longitude and latitude values. `revgeo` uses a free API hosted by https://github.com/komoot/photon[photon] in order to do so.
-
-For example:
-
-[source,r]
-----
-library(revgeo)
-revgeo(longitude=-86.926153, latitude=40.427055, output='frame')
-----
-
-The code above will give you the address information in six columns, from the most-granular `housenumber` to the least-granular `country`. Depending on the coordinates, `revgeo` may or may not give you results for each column. For this question, we are going to keep only the `state` column.
-
-There are over 4 million rows in our dataset -- we do _not_ want to hit https://github.com/komoot/photon[photon's] API that many times. Instead, we are going to do the following:
-
-* Unless you feel comfortable using `data.table`, convert your `data.table` to a `data.frame`:
-
-[source,r]
-----
-my_dataframe <- data.frame(my_datatable)
-----
-
-* Calculate the average `LAT` and `LON` for each `STATECD`, and call the new `data.frame`, `dat`. This should result in 57 rows of lat/long pairs.
-
-* For each row in `dat`, run a reverse geocode and append the `state` to a new column called `STATE`.
-
-[TIP]
-====
-To calculate the average `LAT` and `LON` for each `STATECD`, you could use the https://www.rdocumentation.org/packages/sqldf/versions/0.4-11[`sqldf`] package to run SQL queries on your `data.frame`.
-====
-
-[TIP]
-====
-https://stackoverflow.com/questions/3505701/grouping-functions-tapply-by-aggregate-and-the-apply-family[`mapply`] is a useful apply function to use to solve this problem.
-====
-
-[TIP]
-====
-Here is some extra help:
-
-[source,r]
-----
-library(revgeo)
-points <- data.frame(latitude=c(40.433663, 40.432104, 40.428486), longitude=c(-86.916584, -86.919610, -86.920866))
-# Note that the "output" argument gets passed to the "revgeo" function.
-mapply(revgeo, points$longitude, points$latitude, output="frame")
-# The output isn't in a great format, and we'd prefer to just get the "state" data.
-# Let's wrap "revgeo" into another function that just gets "state" and try again.
-get_state <- function(lon, lat) {
- return(revgeo(lon, lat, output="frame")["state"])
-}
-mapply(get_state, points$longitude, points$latitude)
-----
-====
-
-[IMPORTANT]
-====
-It is okay to get "Not Found" for some of the addresses.
-====
-
-.Items to submit
-====
-- Code used to solve the problem.
-- The `head` of the resulting `data.frame`.
-====
-
-=== Question 5
-
-Use the `leaflet`, `addTiles`, and `addCircles` functions from the `leaflet` package to map our average latitude and longitude data from question (4) to a map (should be a total of 57 lat/long pairs).
-
-[TIP]
-====
-See https://thedatamine.github.io/the-examples-book/r.html#r-ggmap[here] for an example of adding points to a map.
-====
-
-.Items to submit
-====
-- Code used to create the map.
-- The map itself as output from running the code chunk.
-====
-
-=== Question 6
-
-Write a bash script that accepts at least 1 argument, and performs a useful task using at least 1 dataset from the `forest` folder in `/anvil/projects/tdm/data/forest/`. An example of a useful task could be printing a report of summary statistics for the data. Feel free to get creative. Note that tasks must be non-trivial -- a bash script that counts the number of lines in a file is _not_ appropriate. Make sure to properly document (via comments) what your bash script does. Also ensure that your script returns columnar data with appropriate separating characters (for example a csv).
-
-.Items to submit
-====
-- The content of your bash script starting from `#!/bin/bash`.
-- Example output from running your script as intended.
-- A description of what your script does.
-====
-
-=== Question 7
-
-You used `fread` in question (2). Now use the `cmd` argument in conjunction with your script from (6) to read the script output into a `data.table` in your R environment.
-
-.Items to submit
-====
-- The R code used to read in and preprocess your data using your bash script from (6).
-- The `head` of the resulting `data.table`.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project15.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project15.adoc
deleted file mode 100644
index ec17ad6e2..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project15.adoc
+++ /dev/null
@@ -1,85 +0,0 @@
-= STAT 39000: Project 15 -- Fall 2020
-
-**Motivation:** We've done a lot of work with SQL this semester. Let's review concepts in this project and mix and match R and SQL to solve data-driven problems.
-
-**Context:** In this project, we will reinforce topics you've already learned, with a focus on SQL.
-
-**Scope:** SQL, sqlite, R
-
-.Learning objectives
-****
-- Write and run SQL queries in `sqlite` on real-world data.
-- Use SQL from within R.
-****
-
-== Dataset
-
-The following questions will use the dataset found in Scholar:
-
-`/class/datamine/data/movies_and_tv/imdb.db`
-
-F.R.I.E.N.D.S is a popular tv show. They have an interesting naming convention for the names of their episodes. They all begin with the text "The One ...". There are 6 primary characters in the show: Chandler, Joey, Monica, Phoebe, Rachel, and Ross. Let's use SQL and R to take a look at how many times each characters' names appear in the title of the episodes.
-
-== Questions
-
-=== Question 1
-
-Write a query that gets the `episode_title_id`, `primary_title`, `rating`, and `votes`, of all of the episodes of Friends (`title_id` is tt0108778).
-
-[TIP]
-====
-You can slightly modify the solution to question (5) in project 13.
-====
-
-.Items to submit
-====
-- SQL query used to answer the question.
-- First 5 results of the query.
-====
-
-=== Question 2
-
-Now that you have a working query, connect to the database and run the query to get the data into an R data frame. In previous projects, we learned how to used regular expressions to search for text. For each character, how many episodes `primary_title`s contained their name?
-
-.Items to submit
-====
-- R code in a code chunk that was used to find the solution.
-- The solution pasted below the code chunk.
-====
-
-=== Question 3
-
-Create a graphic showing our results in (2) using your favorite package. Make sure the plot has a good title, x-label, y-label, and try to incorporate some of the following colors: #273c8b, #bd253a, #016f7c, #f56934, #016c5a, #9055b1, #eaab37.
-
-.Items to submit
-====
-- The R code used to generate the graphic.
-- The graphic in a png or jpg/jpeg format.
-====
-
-=== Question 4
-
-Now we will turn our focus to other information in the database. Use a combination of SQL and R to find which of the following 3 genres has the highest average rating for movies (see `type` column from `titles` table): Romance, Comedy, Animation. In the `titles` table, you can find the genres in the `genres` column. There may be some overlap (i.e. a movie may have more than one genre), this is ok.
-
-To query rows which have the genre Action as one of its genres:
-
-[source,SQL]
-----
-SELECT * FROM titles WHERE genres LIKE '%action%';
-----
-
-.Items to submit
-====
-- Any code you used to solve the problem in a code chunk.
-- The average rating of each of the genres listed for movies.
-====
-
-=== Question 5
-
-Write a function called `top_episode` in R which accepts the path to the `imdb.db` database, as well as the `title_id` of a tv series (for example, "tt0108778" or "tt1266020"), and returns the `season_number`, `episode_number`, `primary_title`, and `rating` of the highest rated episode in the series. Test it out on some of your favorite series, and share the results.
-
-.Items to submit
-====
-- Any code you used to solve the problem in a code chunk.
-- The results for at least 3 of your favorite tv series.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project01.adoc
deleted file mode 100644
index 15978aaad..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project01.adoc
+++ /dev/null
@@ -1,265 +0,0 @@
-= STAT 19000: Project 1 -- Fall 2021
-
-== Welcome to The Data Mine!
-
-**Motivation:** In this project we are going to jump head first into The Data Mine. We will load datasets into the R environment, and introduce some core programming concepts like variables, vectors, types, etc. As we will be "living" primarily in an IDE called Jupyter Lab, we will take some time to learn how to connect to it, configure it, and run code.
-
-**Context:** This is our first project as a part of The Data Mine. We will get situated, configure the environment we will be using throughout our time with The Data Mine, and jump straight into working with data!
-
-**Scope:** r, Jupyter Lab, Brown
-
-.Learning Objectives
-****
-- Read about and understand computational resources available to you.
-- Learn how to run R code in Jupyter Lab on Brown.
-- Read and write basic (csv) data using R.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/flights/subset/1990.csv`
-- `/depot/datamine/data/movies_and_tv/imdb.db`
-- `/depot/datamine/data/disney/splash_mountain.csv`
-
-== Questions
-
-=== Question 1
-
-For this course, projects will be solved using the Brown computing cluster: https://www.rcac.purdue.edu/compute/brown[Brown]. We may also use the Scholar computing cluster in the future (we have used Scholar in previous years).
-
-Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (RAM). Use the information on the provided webpages to calculate how many cores and how much memory is available _in total_ for both clusters, combined.
-
-Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer, provide a link to a common computer and provide the information for it instead.
-
-.Items to submit
-====
-- A sentence explaining how many cores and how much memory is available, in total, across all nodes on Brown.
-- A sentence explaining how many cores and how much memory is available, in total, for your own computer.
-====
-
-=== Question 2
-
-In previous semesters, we used a program called RStudio Server to run R code on Scholar and solve the projects. This year, instead, we will be using Jupyter Lab on the Brown cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster.
-
-Navigate and login to https://ondemand.anvil.rcac.purdue.edu using 2-factor authentication (ACCESS login on Duo Mobile). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward.
-
-++++
-
-++++
-
-Towards the middle of the top menu, there will be an item labeled btn:[My Interactive Sessions], click on btn:[My Interactive Sessions]. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Datamine, Desktops, and GUIs. Under the Datamine section, you should see a button that says btn:[Jupyter Lab], click on btn:[Jupyter Lab].
-
-If everything was successful, you should see a screen similar to the following.
-
-image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"]
-
-Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 1 CPU core and 3072 Mb or 4096 Mb of memory. We use the Brown cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine.
-
-
-After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on btn:[Connect to Jupyter] to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following.
-
-image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"]
-
-There are 2 primary options that you will need to know about.
-
-f2021-s2022::
-The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment.
-
-[TIP]
-====
-To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/book/projects/templates[our template page].
-====
-
-f2021-s2022-r::
-An alternative, native R kernel that you can use for projects with _just_ R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell.
-
-For now, let's focus on the f2021-s2022-r kernel. Click on btn:[f2021-s2022-r], and a fresh notebook will be created for you.
-
-Test it out! Run the following code in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on. What is the name of the node on Brown that you are running on?
-
-[source,r]
-----
-system("hostname", intern=TRUE)
-----
-
-[TIP]
-====
-To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu.
-====
-
-.Items to submit
-====
-- Code used to solve this problem in a "code" cell.
-- Output from running the code (the name of the node on Brown that you are running on).
-====
-
-=== Question 3
-
-In the upper right-hand corner of your notebook, you will see the current kernel for the notebook, `f2021-s2022-r`. If you click on this name you will have the option to swap kernels out. Change kernels to the `f2021-s2022` kernel, and practice by running the following code examples.
-
-python::
-[source,python]
-----
-my_list = [1, 2, 3]
-print(f'My list is: {my_list}')
-----
-
-SQL::
-[source, sql]
-----
-%load_ext sql
-----
-
-and then, in a separate cell:
-
-[source, sql]
-----
-%%sql
-sqlite:////depot/datamine/data/movies_and_tv/imdb.db
-SELECT * FROM titles LIMIT 5;
-----
-
-
-bash::
-[source,bash]
-----
-%%bash
-awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' /depot/datamine/data/flights/subset/1990.csv
-----
-
-[TIP]
-====
-To learn more about how to run various types of code using this kernel, see https://the-examples-book.com/book/projects/templates[our template page].
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-This year, the first step to starting any project should be to download and/or copy https://the-examples-book.com/book/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Brown at `/depot/datamine/apps/templates/project_template.ipynb`).
-
-++++
-
-++++
-
-++++
-
-++++
-
-Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`.
-
-There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default?
-
-Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A.
-
-.Items to submit
-====
-- How many of each types of cells are there in the default template?
-====
-
-=== Question 5
-
-In question (1) we answered questions about cores and memory for the Brown clusters. To do so, we needed to perform some arithmetic. Instead of using a calculator (or paper, or mental math for you good mental math folks), write these calculations using R _and_ Python, in separate code cells.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 6
-
-++++
-
-++++
-
-In the previous question, we ran our first R and Python code. In the fall semester, we will focus on learning R. In the spring semester, we will learn some Python. Throughout the year, we will always be focused on working with data, so we must learn how to load data into memory. Load your first dataset into R by running the following code.
-
-[source,r]
-----
-dat <- read.csv("/depot/datamine/data/disney/splash_mountain.csv")
-----
-
-Confirm that the dataset has been read in by passing the dataset, `dat`, to the `head()` function. The `head` function will return the first 5 rows of the dataset.
-
-[source,r]
-----
-head(dat)
-----
-
-`dat` is a variable that contains our data! We can name this variable anything we want. We do _not_ have to name it `dat`; we can name it `my_data` or `my_data_set`.
-
-Run our code to read in our dataset, this time, instead of naming our resulting dataset `dat`, name it `splash_mountain`. Place all of your code into a new cell. Be sure to include a level 2 header titled "Question 6", above your code cell.
-
-[TIP]
-====
-In markdown, a level 2 header is any line starting with 2 `\#`'s. For example, `\#\# Question X` is a level 2 header. When rendered, this text will appear much larger. You can read more about markdown https://guides.github.com/features/mastering-markdown/[here].
-====
-
-[TIP]
-====
-If you are having trouble changing a cell due to the drop down menu behaving oddly, try changing browsers to Chrome or Safari. If you are a big Firefox fan, and don't want to do that, feel free to use the `%%markdown` magic to create a markdown cell without _really_ creating a markdown cell. Any cell that starts with `%%markdown` in the first line will generate markdown when run.
-====
-
-[NOTE]
-====
-We didn't need to re-read in our data in this question to make our dataset be named `splash_mountain`. We could have re-named `dat` to be `splash_mountain` like this.
-
-[source,r]
-----
-splash_mountain <- dat
-----
-
-Some of you may think that this isn't exactly what we want, because we are copying over our dataset. You are right, this is certainly _not_ what we want! What if it was a 5Gb dataset, that would be a lot of wasted space! Well, R does copy on modify. What this means is that until you modify either `dat` or `splash_mountain` the dataset isn't copied over. You can therefore run the following code to remove the other reference to our dataset.
-
-[source,r]
-----
-rm(dat)
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 7
-
-Let's pretend we are now done with the project. We've written some code, maybe added some markdown cells to explain what we did, and we are ready to submit our assignment. For this course, we will turn in a variety of files, depending on the project.
-
-We will always require a PDF which contains text, code, and code output. This is our "source of truth" and what the graders will turn to first when grading.
-
-[WARNING]
-====
-You _must_ double check your PDF before submitting it. A _very_ common mistake is to assume that your PDF has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work.
-====
-
-A PDF is generated by first running every cell in the notebook, and then exporting to a PDF.
-
-In addition to the PDF, if a project uses R code, you will need to also submit R code in an R script. An R script is just a text file with the extension `.R`. When submitting Python code, you will need to also submit a Python script. A Python script is just a text file with the extension `.py`.
-
-Let's practice. Take the R code from this project and copy and paste it into a text file with the `.R` extension. Call it `firstname-lastname-project01.R`. Next, take the Python code from this project and copy and paste it into a text file with the `.py` extension. Call it `firstname-lastname-project01.py`. Compile your PDF -- making sure that the output from all of your code is present and in the PDF.
-
-Once complete, submit your PDF, R script, and Python script.
-
-.Items to submit
-====
-- Resulting PDF (`firstname-lastname-project01.pdf`).
-- `firstname-lastname-project01.R`.
-- `firstname-lastname-project01.py`.
-- `firstname-lastname-project01.ipynb`.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project02.adoc
deleted file mode 100644
index cd0653150..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project02.adoc
+++ /dev/null
@@ -1,168 +0,0 @@
-= STAT 19000: Project 2 -- Fall 2021
-
-== Introduction to R using https://whin.org[WHIN] weather data
-
-**Motivation:** The R environment is a powerful tool to perform data analysis. R is a tool that is often compared to Python. Both have their advantages and disadvantages, and both are worth learning. In this project we will dive in head first and learn some of the basics while solving data-driven problems.
-
-[NOTE]
-====
-R and Python both have their advantages and disadvantages. There still exist domains and problems where R is better than Python, and where Python is better than R. In addition, https://julialang.org/[Julia] is another language in this domain that is quickly gaining popularity for it's speed and Python-like ease of use.
-====
-
-**Context:** In the last project we set the stage for the rest of the semester. We got some familiarity with our project templates, and modified and ran some R code. In this project, we will continue to use R within Jupyter Lab to solve problems. Soon, you will see how powerful R is and why it is often a more effective tool to use than a tool like spreadsheets.
-
-**Scope:** r, vectors, indexing, recycling
-
-.Learning Objectives
-****
-- List the differences between lists, vectors, factors, and data.frames, and when to use each.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Read and write basic (csv) data using R.
-- Identify good and bad aspects of simple plots.
-- Explain what "recycling" is in R and predict behavior of provided statements.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/whin/stations.csv`
-- `/depot/datamine/data/whin/weather.csv`
-
-[NOTE]
-====
-These datasets are generously provided to us by one of our corporate partners, the Wabash Heartland Innovation Network (WHIN). You can learn more about WHIN on their website at https://whin.org/[WHIN]. You can learn more about their API https://data.whin.org[here]. This won't be the last time we work with WHIN data, in the future you will get the opportunity to use their API to solve problems that you might not have thought of.
-====
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-While you may not always (and perhaps even rarely) be provided with a neat and clean dataset to work on, when you do, getting a good feel for the dataset(s) is a good first step to solving any data-driven problems.
-
-Use the `read.csv` function to load our datasets into `data.frame`s named `stations` and `weather`.
-
-[NOTE]
-====
-`read.csv` loads data into a `data.frame` object _by default_. We will learn more about the idea of a `data.frame` in the future. For now, just think of it like a spreadsheet, in which data in each column has the same type of data (e.g. numeric data, strings, etc.).
-====
-
-Use functions like `head`, `tail`, `str`, and `summary` to explore the data. What are the dimensions of each dataset? What are the first 5 rows of `stations`? What are the first 5 rows of `weather`? What are the names of the columns in each dataset?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Text answering all of the questions above.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-The following R code extracts the `temperature` column from our `weather` `data.frame`, into a vector named `temp`.
-
-[source,r]
-----
-temp <- weather$temperature
-----
-
-What is the first value in the vector? How about the 100th? What is the last? What type of data is in the vector?
-
-[TIP]
-====
-Use the `typeof` function to find out the type of data in a vector.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-You should now know at least 1 method for extracting data from a `data.frame` (using the `$`), and should now understand a little bit about indexing. Thats great! Use indexing to add the first 100 `rain_inches_last_hour` from the `weather` `data.frame` to the last 100 `rain_inches_last_hour` from the `weather` `data.frame` to a new vector named `temp100`. Do this in 1 line of code.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-++++
-
-++++
-
-In question (3) we were able to rapidly add values together from two subsets of the same vector. This worked out very nicely because both subsets of data contained 100 values. The first value from the first subset of data was added to the first value from the second subset of data, and so on.
-
-For station with `station_id` 20, get a vector containing all `temperature` >= 85. Call this vector `hot_temps`. Get a vector containing all `temperature` \<= 40, and call this vector `cold_temps`. How many elements are in `hot_temps`? How many elements are in `cold_temps`? Attempt to add the vectors together. What happens? Read https://excelkingdom.blogspot.com/2018/01/what-recycling-of-vector-elements-in-r.html[this] to understand what is happening.
-
-[NOTE]
-====
-This is called _recycling_. Recycling is a very powerful feature of R. It allows you to reuse the same vector elements in different contexts. It can also be a very misleading and dangerous feature as it can lead to unexpected results. This is why it is important to pay attention when R gives you a warning -- something that you aren't expecting may be happening behind the scenes.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-Pick any station you are interested in, and create one or more dotplots showing data from any of the columns you are interested in. For each plot, write 1-2 sentences describing any patterns you see. If you don't see any patterns, that is okay, just write, "I don't see any patterns.".
-
-[TIP]
-====
-This is a good opportunity to look at the data in the dataset and explore the variables and see what types of patterns the various variables have. Please feel free to spruce up your plots if you so desire -- it is completely optional, and we will have plenty of time to work on plots as the semester progresses.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- 1-2 sentences describing any patterns you see.
-====
-
-=== Question 6
-
-The following three pieces of code each create a graphic. The first two graphics are created using only core R functions. The third graphic is created using a package called `ggplot`. We will learn more about all of these things later on. For now, pick your favorite graphic, and write 1-2 sentences explaining why it is your favorite, what could be improved, and include any interesting observations (if any).
-
-image::figure04.webp[Plot 1, width=400, height=400, loading=lazy, title="Plot 1"]
-
-image::figure05.webp[Plot 2, width=400, height=400, loading=lazy, title="Plot 2"]
-
-image::figure06.webp[Plot 3, width=400, height=400, loading=lazy, title="Plot 3"]
-
-.Items to submit
-====
-- 1-2 sentences explaining which is your favorite graphic, why, what could be improved, and any interesting observations you may have (if any).
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project03.adoc
deleted file mode 100644
index ac15d094c..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project03.adoc
+++ /dev/null
@@ -1,175 +0,0 @@
-= STAT 19000: Project 3 -- Fall 2021
-
-**Motivation:** `data.frames` are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a `data.frame`.
-
-**Context:** In the previous project we got our feet wet, ran our first R code, and learned about accessing data inside vectors. In this project we will continue to reinforce what we've already learned and introduce a new, flexible data structure called `data.frame`s.
-
-**Scope:** r, data.frames, recycling, factors
-
-.Learning Objectives
-****
-- Explain what "recycling" is in R and predict behavior of provided statements.
-- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- List the differences between lists, vectors, factors, and data.frames, and when to use each.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/olympics/*.csv`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-++++
-
-++++
-
-Use R code to print the names of the two datasets in the `/depot/datamine/data/olympics` directory.
-
-Read the larger dataset into a data.frame called `olympics`.
-
-Print the first 6 rows of the `olympics` data.frame, and take a look at the columns. Based on that, write 1-2 sentences describing the dataset (how many rows, how many columns, the type of data, etc.) and what it holds.
-
-**Relevant topics:** list.files, file.info, read.csv, head
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- 1-2 sentences explaining the dataset.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-++++
-
-++++
-
-How many unique sports are accounted for in our `olympics` dataset? Print a list of the sports. Is there any sport that you weren't expecting? Why or why not?
-
-[IMPORTANT]
-====
-R is a case-sensitive language. What this means is that whether or not 1 or more letters in a word are capitalized is important. For example, the following two variables are different.
-
-[source,r]
-----
-vec <- c(1,2,3)
-Vec <- c(3,2,1) # note the capital "V" in our variable name
-
-print(vec) # will print: 1,2,3
-print(Vec) # will print: 3,2,1
-----
-
-So, when you are examining a `data.frame` and you see a column name that starts with a capital letter, it is critical that you use the same capitalization when trying to access said column.
-
-[source,r]
-----
-colnames(iris)
-----
-
-----
-[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
-----
-
-[source,r]
-----
-iris$"sepal.Length" # will NOT work
-iris$"Sepal.length" # will NOT work
-iris$"Sepal.Length" # will work
-----
-====
-
-**Relevant topics:** unique, length
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- 1-2 sentences explaining the results.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Create a data.frame called `us_athletes` that contains only information on athletes from the USA. Use the column `NOC` (National Olympic Committee 3-letter code). How many rows does `us_athletes` have?
-
-Now, perform the same operation on the `olympics` data.frame, this time containing only the information on the athletes from the country of your choice. Name this new data.frame appropriately. How many rows does it have?
-
-Now, create a data.frame called `both` that contains the information on the athletes from the USA and the country of your choice. How many rows does it have?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- How many rows or athletes in the `us_athletes` dataset?
-- How many rows or athletes in the other country's dataset?
-- How many rows or athletes in the `both` dataset?
-====
-
-=== Question 4
-
-++++
-
-++++
-
-++++
-
-++++
-
-What percentage of US athletes are women? What percentage of US athletes with gold medals are women?
-
-Answer the same questions for your "other" country from question (3).
-
-**Relevant topics:** prop.table, table, indexing
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-What is the oldest US athlete to compete based on our `us_athletes` data.frame? At what age, in which sport, and what year did the athlete compete in?
-
-Answer the same questions for your "other" country from question (3) and question (4).
-
-[IMPORTANT]
-====
-Make sure you using indexing to _only_ print the athlete's information (age, sport, year).
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Age, sport, and olympics year that the oldest athlete competed in, for each of your countries.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project04.adoc
deleted file mode 100644
index 27ff1acd5..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project04.adoc
+++ /dev/null
@@ -1,187 +0,0 @@
-= STAT 19000: Project 4 -- Fall 2021
-
-**Motivation:** Control flow consists of the tools and methods that you can use to control the order in which instructions are executed. We can execute certain tasks or code if certain requirements are met using if/else statements. In addition, we can perform operations many times in a loop using for loops. While these are important concepts to grasp, R differs from other programming languages in that operations are usually vectorized and there is little to no need to write loops.
-
-**Context:** We are gaining familiarity working in Jupyter Lab and writing R code. In this project we introduce and practice using control flow in R, while continuing to reinforce concepts from the previous projects.
-
-**Scope:** r, data.frames, recycling, factors, if/else, for loops
-
-.Learning Objectives
-****
-- Explain what "recycling" is in R and predict behavior of provided statements.
-- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- List the differences between lists, vectors, factors, and data.frames, and when to use each.
-- Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/olympics/*.csv`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-Winning an olympic medal is an amazing achievement for athletes.
-
-Before we take a deep dive into our data, it is always important to do some sanity checks, and understand what population our sample is representative of, particularly if we didn't get a chance to participate in the data collection phase of the project.
-
-Lets do a quick check on our dataset. We would expect that most athletes would not have won a medal. What percentage of athletes did not get a medal in the `olympics` data.frame?
-
-For simplicity, consider an "athlete" a row of the data.frame. Do not worry about the same athlete participating in the different olympics games, or in different sports.
-
-We are considering the combination of `Sport`, `Event`, and `Games`, as a unique identifier for an athlete.
-
-**Relevant topics:** is.na, mean, indexing, sum
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-A friend of yours hypothesized that there is some association between getting a medal and the athletes age.
-
-You want to test it out using our `olympics` data.frame. To do so we will compare 2 new variables:
-
-. (This question) An indicator if the athlete in that year and sport won a medal or not
-. (Next question) Age converted into categories of ages.
-
-Create a new variable in your `olympics` data.frame called `won_medal` which indicates whether the athlete in that year and sport won a medal or not.
-
-**Relevant topics:** is.na
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Now that we have our first new column, `won_medal`, let's categorize the `Age` column. Using a for loop and if/else statements, create a new column in the `olympics` data.frame called `age_cat`. Use the guidelines below to do so.
-
-- "youth": less than 18 years old
-- "young adult": between 18 and 25 years old
-- "adult": 26 to 35 years old
-- "middle age adult": between 36 to 55 years old
-- "wise adult": greater than 55 years old
-
-How many athletes are "young adults"?
-
-[TIP]
-====
-Remember to consider the `NA`s as you are solving the problem.
-====
-
-**Relevant topics:** nrow, if/else, for loops, indexing, is.na
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- How many athletes are "young adults"?
-====
-
-=== Question 4
-
-++++
-
-++++
-
-We did _not_ need to use a for loop to solve the previous problem. Another way to solve the problem would have been to use a vectorized function called `cut`.
-
-Create a variable called `age_cat_cut` using the `cut` function to solve the problem above.
-
-[TIP]
-====
-To check that you are getting the same results, run the following commands.
-
-If you used `cut`s `labels` argument in your code:
-
-[source,r]
-----
-all.equal(as.character(age_cat_cut), olympics$age_cat)
-----
-
-If you didn't use `cut`s `labels` argument in your code:
-
-[source,r]
-----
-levels(age_cat_cut) <- c('youth', 'young adult', 'adult', 'middle age adult', 'older adult')
-all.equal(as.character(age_cat_cut), olympics$age_cat)
-----
-====
-
-[TIP]
-====
-Note that by default `cut` considers the breaks as right intervals. For example, if the breaks are c(a,b,c) the intervals will be "(a, b], (b, c]".
-====
-
-[TIP]
-====
-You can use the argument `labels` in `cut` to label the categories similarly to what we did in question (2).
-====
-
-[NOTE]
-====
-These past 2 questions do a good job emphasizing the importance of vectorized functions. How long did it take you to run the solution to question (3) vs question (4)? If you find yourself looping through one or more columns one at a time, there is likely a better option.
-====
-
-**Relevant topics:** cut
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Now that we have the new columns in the `olympics` data.frame, look at the data and write down your conclusions. Is there some association between winning a medal and the athletes age?
-
-There a couple of ways you can look at the data to make your conclusions. You can visualize using plots, using functions like `barplot`, and `pie`. Alternatively, you can use numeric summaries, like a table or table with proportions (`prop.table`). Regardless of the method used, explain your findings, and feel free to get creative!
-
-[NOTE]
-====
-You do not need to use any special statistical test to make your conclusions. The goal of this question is to explore the data and think logically.
-====
-
-[TIP]
-====
-The argument `margin` may be useful if you use the `prop.table` function.
-====
-
-**Relevant topics:** barplot, pie, indexing, table, prop.table, balloonplot
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project05.adoc
deleted file mode 100644
index d38c80587..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project05.adoc
+++ /dev/null
@@ -1,248 +0,0 @@
-= STAT 19000: Project 5 -- Fall 2021
-
-**Motivation:** As briefly mentioned in project 4, R differs from other programming languages in that _typically_ you will want to avoid using for loops, and instead use vectorized functions and the "apply" suite. In this project we will use vectorized functions to solve a variety of data-driven problems.
-
-**Context:** While it was important to stop and learn about looping and if/else statements, in this project, we will explore the R way of doing things.
-
-**Scope:** r, data.frames, recycling, factors, if/else, for loops, apply suite
-
-.Learning Objectives
-****
-- Explain what "recycling" is in R and predict behavior of provided statements.
-- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- List the differences between lists, vectors, factors, and data.frames, and when to use each.
-- Demonstrate a working knowledge of control flow in r: for loops.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/youtube/*.{csv,json}`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-Read the dataset `USvideos.csv` into a data.frame called `us_youtube`. The dataset contains YouTube trending videos between 2017 and 2018.
-
-[NOTE]
-====
-The dataset has two columns that refer to time: `trending_date` and `publish_time`.
-
-The column `trending_date` is organized in a `[year].[day].[month]` format, while the `publish_time` is in a different format.
-====
-
-When working with dates, it is important to use tools specifically for this purpose (rather, than using string manipulation, for example). We've provided you with the code below. The provided code uses the `lubridate` package, an excellent package which hides away many common issues that occur when working with dates. Feel free to check out https://raw.githubusercontent.com/rstudio/cheatsheets/master/lubridate.pdf[the official cheatsheet] in case you'd like to learn more about the package.
-
-Run the code below to extract to create two new columns: `trending_year` and `publish_year`.
-
-[source,r]
-----
-library(lubridate)
-
-# convert columns to date formats
-us_youtube$trending_date <- ydm(us_youtube$trending_date)
-us_youtube$publish_time <- ymd_hms(us_youtube$publish_time)
-
-# extract the trending_year and publish_year
-us_youtube$trending_year <- year(us_youtube$trending_date)
-us_youtube$publish_year <- year(us_youtube$publish_time)
-
-unique(us_youtube$trending_year)
-unique(us_youtube$publish_year)
-----
-
-Take a look at our newly created columns. What type are the new columns? In the provided code, which (if any) of the 4 functions are vectorized?
-
-Now, duplicate the functionality of the provided code using only the following functions: `as.numeric`, `substr`, and regular vectorized operations like `+`, `-`, `*` and `/`. Which was easier?
-
-**Relevant topics:** read.csv, typeof
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-While some great content certainly comes out of the United States, we have a lot of other great content from other countries. Plus, the size of the data is reasonable to combine into a single data.frame.
-
-Look in the following directory: `/depot/datamine/data/youtube`. You will find files that look like this:
-
-----
-CAvideos.csv
-DEvideos.csv
-USvideos.csv
-...
-----
-
-You will notice how each dataset follows the same naming convention. Each file starts with the country code, `US`, `DE`, `CA`, etc, followed immediately by "videos.csv".
-
-Use a loop and the given vector to systematically combine the data into a new data.frame called `yt`.
-
-[source,r]
-----
-countries <- c('CA', 'DE', 'FR', 'GB', 'IN', 'JP', 'KR', 'MX', 'RU', 'US')
-----
-
-In a loop, loop through each of the values in `countries`. Use the `paste0` function to create a string that is the absolute path to each of the files. So, for example, the following would represent the steps to perform in the first loop.
-
-- In our first loop we have the value `CA`.
-- We would use `paste0` to create a string containing the absolute path of the corresponding dataset: `/depot/datamine/data/youtube/CAvideos.csv`.
-- Then, we would then use that string as an argument to the `read.csv` function to read in the data into a data.frame.
-- Then, we would add the new column `country_code` to the data.frame with the value `CA` repeated for each row.
-- Finally, you would use the rbind function to combine the new data.frame with the previous data.frame.
-
-In the end, you will end up with a single data.frame called `yt`, that contains the data for _every_ country in the dataset. `yt` will _also_ have a column called `country_code` that contains the country code for each row, so we know where the data originated.
-
-[IMPORTANT]
-====
-When combining data, it is important that we don't lose any data in the process. If we slapped together all of the data from each of the datasets into a single file named `yt.csv`, what data would we lose?
-====
-
-In order to prevent this loss of data, create a new column called `country_code` that includes this information in the dataset rather than in the filename.
-
-Print a list of the columns in `yt`, in addition, print the dimensions of `yt`. Finally, create the `trending_year` and `publish_year` columns for `yt`.
-
-[source,r]
-----
-# Dr Ward summarizes how to perform Question 2 in the video.
-# Here is the analogous code for this question.
-# We know that all of this is new for you.
-# That is why we are guiding you through this question!
-
-getdataframe <- function(mycountry) {
- myDF <- read.csv(paste0("/depot/datamine/data/youtube/", mycountry, "videos.csv"))
- myDF$country_code <- mycountry
- return(myDF)
-}
-
-countries <- c('CA', 'DE', 'FR', 'GB', 'IN', 'JP', 'KR', 'MX', 'RU', 'US')
-
-myresults <- lapply(countries, getdataframe)
-
-yt <- do.call(rbind, myresults)
-
-----
-
-**Relevant topics:** read.csv, paste0, rbind, dim, colnames
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-[IMPORTANT]
-====
-From this point on, unless specified, use the `yt` data.frame to answer the questions.
-====
-
-Which YouTube video took the longest time to trend from the time it was published? How many years did it take to trend?
-
-**Relevant topics:** which.max, indexing
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Name of the YouTube video, and how long it took to trend.
-- (Optional) Did you watch the video prior to the project? If so, what do you think about it?
-====
-
-=== Question 4
-
-++++
-
-++++
-
-We are interested in seeing whether or not there is a difference in views between videos with ratings enabled vs. those with ratings disabled.
-
-Calculate the average number of views for videos with ratings enabled and those with ratings disabled. Anecdotally, does it look like disabling the ratings helps or hurts the views?
-
-[TIP]
-====
-You can use `tapply` to solve this problem if you are comfortable with the `tapply` function. Otherwise, stay tuned in a future project where we will explore the `tapply` function in more detail.
-====
-
-[TIP]
-====
-You _may_ need to take a careful look at the `ratings_disabled` column. What type should this column be? Make sure to convert if necessary.
-====
-
-**Relevant topics:** mean, tapply indexing
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Create two new columns in `yt`:
-
-- `balance`: the difference between `likes` and `dislikes` for a given video.
-- `positive_balance`: an indicator variable that is `TRUE` if `balance` is greater than zero, and `FALSE` otherwise.
-
-How many videos have a positive balance?
-
-**Relevant topics:** sum
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 6
-
-Compare videos with a positive `positive_balance` to those with a non-positive `positive_balance`. Make this comparison based on the `comment_count` and the `views` of the videos.
-
-To make a comparison, pick a statistic to summarize and compare `comment_count` and `views`. Examples of statistics include: `mean`, `median`, `max`, `min`, `var`, and `sd`.
-
-You can pick more than one statistic to compare, if you want, and each column may have its own statistic(s) to summarize it.
-
-**Relevant topics:** tapply, mean, sum, var, sd, max, min, median
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- 1-2 sentences explaining what statistic you chose to summarize each column, and why.
-- 1-2 sentences comparing videos with positive balance and non-positive balance based on `comment_count` and `views`. Is the result surprising to you?
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project06.adoc
deleted file mode 100644
index 762b1ae7a..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project06.adoc
+++ /dev/null
@@ -1,164 +0,0 @@
-= STAT 19000: Project 6 -- Fall 2021
-
-**Motivation:** `tapply` is a powerful function that allows us to group data, and perform calculations on that data in bulk. The "apply suite" of functions provide a fast way of performing operations that would normally require the use of loops. If you have any familiarity with SQL, it `tapply` is very similar to working with the `GROUP BY` clause -- you first group your data using some rule, and then perform some operation for each newly created group.
-
-**Context:** The past couple of projects have studied the use of loops and/or vectorized operations. In this project, we will introduce a function called `tapply` from the "apply suite" of functions in R.
-
-**Scope:** r, tapply
-
-.Learning Objectives
-****
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- List the differences between lists, vectors, factors, and data.frames, and when to use each.
-- Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc.
-- Demonstrate using tapply to perform calculations on subsets of data.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/amazon/tracks.csv`
-- `/depot/datamine/data/amazon/tracks.db`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-Load the `tracks.csv` file into an R data.frame called `tracks`. Immediately after loading the file, run the following.
-
-[source,r]
-----
-str(tracks)
-----
-
-What happens?
-
-[TIP]
-====
-The C in CSV is not true for this dataset! You'll need to take advantage of the `sep` argument of `read.csv` to read in this dataset.
-====
-
-Once you've successfully read in the data, re-run the following.
-
-[source,r]
-----
-str(tracks)
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Great! `tapply` is a very cool, very powerful function in R.
-
-First, let's say that we wanted to see what the average `duration` (a column in the `tracks` data.frame) of songs were _by_ each `year` (a column in the `tracks` data.frame). If you think about how you would approach solving this problem, there are a lot of components to keep track of!
-
-- You don't know ahead of time how many different years are in the dataset.
-- You have to associate each sum of `duration` with a specific `year`.
-- Etc.
-
-Its a lot of work!
-
-In R, there is a really great library that allows us to run queries on an sqlite database and put the result directly into a dataframe. This would be the SQL and R solution to this problem.
-
-[source,r]
-----
-library(RSQLite)
-
-con <- dbConnect(SQLite(), dbname = "/depot/datamine/data/amazon/tracks.db")
-myDF <- dbGetQuery(con, "SELECT year, AVG(duration) AS average_duration FROM songs GROUP BY year;")
-head(myDF)
-----
-
-Use `tapply` to solve the same problem! Are your results the same? Print the first 5 results to make sure they are the same.
-
-[TIP]
-====
-`tapply` can take a minute to get the hang of. I like to think about the first argument to `tapply` as the column of data we want to _perform an operation_ on, the second argument to `tapply` as the column of data we want to _group_ by, and the third argument as the operation (as a function, like `sum`, or `median`, or `mean` or `sd`, or `var`, etc.) we want to perform on the data.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Plot the results of question (2) with any appropriate plot that will highlight the duration of music by year, sequentially. What patterns do you see, if any?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-Ha! Thats not so bad! What are the `artist_name` of the artists with the highest median `duration` of songs? Sort the results of the `tapply` function in descending order and print the first 5 results.
-
-[CAUTION]
-====
-This may take a few minutes to run -- this function is doing a lot and there are a lot of artists in this dataset!
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Explore the dataset and come up with a question you want to answer. Make sure `tapply` would be useful with your investigation, and use `tapply` to calculate something interesting for the dataset. Create one or more graphics as you are working on your question. Write 1-2 sentences reviewing your findings. It could be anything, and your findings do not need to be "good" or "bad", they can be boring (much like a lot of research findings)!
-
-.Items to submit
-====
-- Question you want to answer.
-- Code used to solve this problem.
-- Output (including graphic(s)) from running the code.
-- 1-2 sentences reviewing your findings.
-====
-
-=== Question 6 (optional, 0 pts)
-
-Use the following SQL and R code and take a crack at solving a problem (any problem) you want to do with R and SQL. You can use the following code to help. Create a cool graphic with the results!
-
-[source,r]
-----
-library(RSQLite)
-
-con <- dbConnect(SQLite(), dbname = "/depot/datamine/data/amazon/tracks.db")
-myDF <- dbGetQuery(con, "SELECT year, AVG(duration) AS average_duration FROM songs GROUP BY year;")
-myDF
-----
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project07.adoc
deleted file mode 100644
index ab102521f..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project07.adoc
+++ /dev/null
@@ -1,219 +0,0 @@
-= STAT 19000: Project 7 -- Fall 2021
-
-**Motivation:** A couple of bread-and-butter functions that are a part of the base R are: `subset`, and `merge`. `subset` provides a more natural way to filter and select data from a data.frame. `merge` brings the principals of combining data that SQL uses, to R.
-
-**Context:** We've been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with these useful functions, all the while gaining experience and practice wrangling data!
-
-**Scope:** r, subset, merge, tapply
-
-.Learning Objectives
-****
-- Gain proficiency using split, merge, and subset.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Demonstrate how to use tapply to solve data-driven problems.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/goodreads/csv/*.csv`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-++++
-
-++++
-
-Read the `goodreads_books.csv` into a data.frame called `books`. Let's say Dr. Ward is working on a book and new content. He is looking for advice and wants some insight from us.
-
-A friend told him that he should pick a month in the Summer to publish his book.
-
-Based on our `books` dataset, is there any evidence that certain months get higher than average rating? What month would you suggest for Dr. Ward to publish his new book?
-
-[TIP]
-====
-Use columns `average_rating` and `publication_month` to solve this question.
-====
-
-[TIP]
-====
-To read the data in faster and more efficiently, try the following:
-
-[source,r]
-----
-library(data.table)
-books <- fread("/path/to/data")
-----
-====
-
-**Relevant topics:** tapply, mean
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- 1-2 sentences comparing the publication month based on average rating.
-- 1-2 sentences with your suggestion to Dr. Ward and reasoning.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Create a new column called `book_size_cat` that is a categorical variable based on the number of pages a book has.
-
-`book_size_cat` should have 3 levels: `small`, `medium`, `large`.
-
-Run the code below to get different summaries and visualizations of the number of pages books have in our datasets.
-
-[source,r]
-----
-summary(books$num_pages)
-hist(books$num_pages)
-hist(books$num_pages[books$num_pages <= 1000])
-boxplot(books$num_pages[books$num_pages < 4000])
-----
-
-Pick the values from which to separate these levels by. Write 1-2 sentences explaining why you pick those values.
-
-[TIP]
-====
-You can do other visualizations to determine. Have fun, there is no right or wrong. What would you consider a small, medium, and large book?
-====
-
-**Relevant topics:** cut
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- 1-2 sentences explaining the values you picked to create your categorical data and why.
-- The results of running `table(books$book_size_cat)`.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Dr. Ward is a firm believer in constructive feedback, and would like people to provide feedback for his book.
-
-What recommendation would you make to Dr. Ward when it comes to book size?
-
-[TIP]
-====
-Use the column `text_reviews_count` and compare, on average, how many text reviews the various book sizes get.
-====
-
-[NOTE]
-====
-Association is not causation, and there are many factors that lead to people providing reviews. Your recommendation can be based on anecdotal evidence, no worries.
-====
-
-**Relevant topics:** tapply, mean
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- 1-2 sentences with your recommendation and reasoning.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-Sometimes (often times) looking at a single summary of our data may not provide the full picture.
-
-Make a side-by-side boxplot for the `text_reviews_count` by `book_size_cat`.
-
-Does your answer to question (3) change based on your plot?
-
-[TIP]
-====
-Take a look at the first example when you run `?boxplot`.
-====
-
-[TIP]
-====
-You can make three boxplots if you prefer, but make sure that they all have the same y-axis limit to make the comparisons.
-====
-
-**Relevant topics:** boxplot
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- 1-2 sentences with your recommendation and reasoning.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-Repeat question (4), this time, use the `subset` function to reduce your data to books with a `text_reviews_count` less than 200. How does this change your plot? Is it a little easier to read?
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- 1-2 sentences with your recommendation and reasoning.
-====
-
-=== Question 6
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-Read the `goodreads_book_authors.csv` into a new data.frame called `authors`.
-
-Use the `merge` function to combine the `books` data.frame with the `authors` data.frame. Call your new data.frame `books_authors`.
-
-Now, use the `subset` function to create get a subset of your data for your favorite authors. Include at least 5 authors that appear in the dataset.
-
-Redo question (4) using this new subset of data. Does your recommendation change at all?
-
-[TIP]
-====
-Make sure you pay close attention to the resulting `books_authors` data.frame. The column names will be changed to reflect the merge. Instead of `text_reviews_count` you may need to use `text_reviews_count.x`, or `text_reviews_count.y`, depending on how you merged.
-====
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- 1-2 sentences with your recommendation and reasoning.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project08.adoc
deleted file mode 100644
index 9a1a7a62a..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project08.adoc
+++ /dev/null
@@ -1,222 +0,0 @@
-= STAT 19000: Project 8 -- Fall 2021
-
-**Motivation:** A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code!
-
-**Context:** We've been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon.
-
-**Scope:** r, functions
-
-.Learning Objectives
-****
-- Gain proficiency using split, merge, and subset.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Demonstrate how to use tapply to solve data-driven problems.
-- Comprehend what a function is, and the components of a function in R.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/goodreads/csv/interactions_subset.csv`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-++++
-
-++++
-
-Read the `interactions_subset.csv` into a data.frame called `interactions`. We have provided you with the function `get_probability_of_review` below.
-
-After reading in the data, run the code below, and add comments explaining what the function is doing at each step.
-
-[source,r]
-----
-# A function that, given a string (userID) and a value (min_rating) returns a value (probability_of_reviewing).
-get_probability_of_review <- function(interactions_dataset, userID, min_rating) {
- # FILL IN EXPLANATION HERE
- user_data <- subset(interactions_dataset, user_id == userID)
-
- # FILL IN EXPLANATION HERE
- read_user_data <- subset(user_data, is_read == 1)
-
- # FILL IN EXPLANATION HERE
- read_user_min_rating_data <- subset(read_user_data, rating >= min_rating)
-
- # FILL IN EXPLANATION HERE
- probability_of_reviewing <- mean(read_user_min_rating_data$is_reviewed)
-
- # Return the result
- return(probability_of_reviewing)
-}
-
-get_probability_of_review(interactions_dataset = interactions, userID = 5000, min_rating = 3)
-----
-
-Provide 1-2 sentences explaining overall what the function is doing and what arguments it requires.
-
-[TIP]
-====
-You may want to use `fread` function from the library `data.table` to read in the data.
-====
-
-[source,r]
-----
-library(data.table)
-interactions <- fread("/path/to/dataset")
-----
-
-[CAUTION]
-====
-Your kernel may crash! As it turns out, the `subset` function is not very memory efficient (never fully trust a function). When you launch your Jupyter Lab session, if you use 3072 MB of memory, your kernel is likely to crash on this example. If (instead) you use 5120 MB of memory when you launch your session, you should have sufficient memory to run these examples.
-====
-
-**Relevant topics:** function, subset
-
-.Items to submit
-====
-- R code used to solve this problem.
-- Modified `get_probability_of_review` with comments explaining each step.
-- 1-2 sentences explaining overall what the function is doing.
-- Number and name of arguments for the function, `get_probability_of_review`.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-We want people that use our function to be able to get results even if they don't provide a minimum rating value.
-
-Modify the function `get_probability_of_review` so `min_rating` has the default value of 0. Test your function as follows.
-
-[source,r]
-----
-get_probability_of_review(interactions_dataset = interactions, userID = 5000)
-----
-
-Now, in R (and in most languages), you can provide the arguments out of order, as long as you provide the argument name on the left of the equals sign and the value on the right. For example the following will still work.
-
-[source,r]
-----
-get_probability_of_review(userID = 5000, interactions_dataset = interactions)
-----
-
-In addition, you don't have to provide the argument names when you call the function, however, you _do_ have to place the arguments in order when you do.
-
-[source,r]
-----
-get_probability_of_review(interactions, 5000)
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Our function may not be the most efficient. However, we _can_ reduce the code a little bit! Modify our function so we only use the `subset` function once, rather than 3 times.
-
-Test your modified function on userID 5000. Do you get the same results as above?
-
-Now, instead of using `subset`, just use regular old indexing in your function. Do your results agree with both versions above?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-Run the code below. Explain what happens, and why it is happening.
-
-[source,r]
-----
-head(read_user_min_rating_data)
-----
-
-[TIP]
-====
-Google "Scoping in R", and read.
-====
-
-.Items to submit
-====
-- The results of running the R code.
-- 1-2 sentences explaining what happened.
-- 1-2 sentences explaining why it is happening.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-++++
-
-++++
-
-Apply our function to the `interactions` dataset to get, for a sample of 10 users, the probability of reviewing books given that they liked the book.
-
-Save this probability to a vector called `prob_review`.
-
-To do so, determine a minimum rating (`min_rating`) value when calculating that probability. Provide 1-2 sentences explaining why you chose this value.
-
-[TIP]
-====
-You can use the function `sample` to get a random sample of 10 users.
-====
-
-[TIP]
-====
-You can pick any 10 users you want to compose your sample.
-====
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- 1-2 sentences explaining why you this particular minimum rating value.
-====
-
-=== Question 6
-
-Change the minimum rating value, and re-calculate the probability for your selected 10 users.
-
-Make 1 (or more) plot(s) to compare the results you got with the different minimum rating value. Write 1-2 sentences describing your findings.
-
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- 1-2 sentences comparing the results for question (5) and (6).
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project09.adoc
deleted file mode 100644
index 76bfdbb0a..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project09.adoc
+++ /dev/null
@@ -1,167 +0,0 @@
-= STAT 19000: Project 9 -- Fall 2021
-:page-mathjax: true
-
-**Motivation:** A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code!
-
-**Context:** We've been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon.
-
-**Scope:** r, functions
-
-.Learning Objectives
-****
-- Gain proficiency using split, merge, and subset.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Demonstrate how to use tapply to solve data-driven problems.
-- Comprehend what a function is, and the components of a function in R.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/election/*.txt`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-https://en.wikipedia.org/wiki/Benford%27s_law[Benford's law] has many applications, the most famous probably being fraud detection.
-
-[quote, wikipedia, 'https://en.wikipedia.org/wiki/Benford%27s_law']
-____
-Benford's law, also called the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small. In sets that obey the law, the number 1 appears as the leading significant digit about 30 % of the time, while 9 appears as the leading significant digit less than 5 % of the time. If the digits were distributed uniformly, they would each occur about 11.1 % of the time. Benford's law also makes predictions about the distribution of second digits, third digits, digit combinations, and so on.
-____
-
-Benford's law is given by the equation below.
-
-$P(d) = \dfrac{\ln((d+1)/d)}{\ln(10)}$
-
-$d$ is the leading digit of a number (and $d \in \{1, \cdots, 9\}$)
-
-Create a function called `benfords_law` that takes the argument `digit`, and calculates the probability of `digit` being the starting digit of a random number based on Benford's law above.
-
-Consider `digit` to be a single value. Test your function on digit 7.
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- The results of running `benfords_law(7)`.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Let's make our function more user friendly. When creating functions, it is important to think where you are going to use it, and if other people may use it as well.
-
-Adding error catching statements can help make sure your function is not used out of context.
-
-Add the following error catching by creating an if statement that checks if `digit` is between 1 and 9. If not, use the `stop` function to stop the function, and return a message explaining the error or how the user could avoid it.
-
-Consider `digit` to be a single value. Test your new `benfords_law` function on digit 0.
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- The results of running `benfords_law(0)`.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Our `benfords_law` function was created to calculate the value of a single digit. We have discussed in the past the advantages of having a vectorized function.
-
-Modify `benfords_law` to accept a vector of leading digits. Make sure `benfords_law` stops if any value in the vector `digit` is not between 1 and 9.
-
-Test your vectorized `benfords_law` using the following code.
-
-[source,r]
-----
-benfords_law(0:5)
-benfords_law(1:6)
-----
-
-[TIP]
-====
-There are many ways to solve this problem. You can use for loops, use the functions `sapply` or `Vectorize`. However, the simplest way may be to take a look at our `if` statement as the function `log` is already vectorized.
-====
-
-**Relevant topics:** any
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Calculate the Benford's law for all possible digits (1 to 9). Create a graph to illustrate the results. You can use a barplot, a lineplot, or a combination of both.
-
-Make sure you add a title to your plot, play with the colors and aesthetics of your plot. Have fun!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-Now that we have and understand the theoretical probabilities of Benford's Law, how about we use it to try to find anomalies in the elections dataset?
-
-As we mentioned previously, Benford's Law is very commonly used in fraud detection. Fraud detection algorithms looks for anomalies in datasets based on certain criteria and flag it for audit or further exploration.
-
-Not every anomaly is a fraud, but it _is_ a good start.
-
-We will continue this in our next project, but we can start to set things up.
-
-Create a function called `get_starting_digit` that has one argument, `transaction_vector`.
-
-The function should return a vector containing the starting digit for each value in the `transaction_vector`.
-
-For example, `get_starting_digit(c(10, 2, 500))` should return `c(1, 2, 5)`. Make sure that the the results of `get_starting_digit` is a numeric vector.
-
-Test your code running the following code.
-
-[source,r]
-----
-str(get_starting_digit(c(100,2,50,689,1)))
-----
-
-[TIP]
-====
-There are many ways to solve this question.
-====
-
-**Relevant topics:* as.numeric, substr
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project10.adoc
deleted file mode 100644
index 9032cc890..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project10.adoc
+++ /dev/null
@@ -1,183 +0,0 @@
-= STAT 19000: Project 10 -- Fall 2021
-
-**Motivation:** Functions are powerful. They are building blocks to more complex programs and behavior. In fact, there is an entire programming paradigm based on functions called https://en.wikipedia.org/wiki/Functional_programming[functional programming]. In this project, we will learn to apply functions to entire vectors of data using `sapply`.
-
-**Context:** We've just taken some time to learn about and create functions. One of the more common "next steps" after creating a function is to use it on a series of data, like a vector. `sapply` is one of the best ways to do this in R.
-
-**Scope:** r, sapply, functions
-
-.Learning Objectives
-****
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Utilize apply functions in order to solve a data-driven problem.
-- Gain proficiency using split, merge, and subset.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/election/*.txt`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-Read the elections dataset from 2014 (`itcont2014.txt`) into a data.frame called `elections` using the `fread` function from the `data.table` package.
-
-[TIP]
-====
-Make sure to use the correct argument `sep='|'` from the `fread` function.
-====
-
-Create a vector called `transactions_starting_digit` that gets the starting digit for each transaction value (use the `TRANSACTION_AMT` column). Be sure to use `get_starting_digit` function from the previous project.
-
-Take a look at the starting digits of the unique transaction amounts. Can we directly compare the results to the Benford's law to look for anomalies? Explain why or why not, and if not, what do we need to do to be able to make the comparisons?
-
-[TIP]
-====
-Pay close attention to the results -- if you were able to directly compare, the numbers you were testing would need to be _valid_ for the benfords law function.
-====
-
-[TIP]
-====
-What are the possible digits a number can start with?
-====
-
-**Relevant topics:** fread, unique, table
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- 1-2 sentences explaining if any changes are needed in our dataset to analyze it using Benford's Law, why or why not? If so what changes are necessary?
-====
-
-=== Question 2
-
-[TIP]
-====
-Be sure to watch the video from Question 1. It covers Question 2 too.
-====
-
-If in question (1) you answered that there are modifications needed in the data, make the necessary modifications.
-
-[TIP]
-====
-You _should_ need to make a modification.
-====
-
-Make a barplot showing the percentage of times each digit was the starting digit.
-
-Include in your barplot a line indicating expected percentage based on Benford's law.
-
-If we compared our results to Benford's Law would we consider the findings anomalous? Explain why or why not.
-
-**Relevant topics:** barplot, lines, points, table, prop.table, indexing
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- 1-2 sentences explaining why or why not you think the results for this dataset are anomalous based on Benford's law.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Lets explore things a bit more. How does a different grouping look? To facilitate our analysis, lets create a function to replicate the steps from questions (1) and (2).
-
-Create a function called `compare_to_benfords` that accepts two arguments, `values` and `title`. `values` represents a vector of values to analyze using Benford's Law, and `title` will provide the `title` of our resulting plot.
-
-Make sure the `title` argument has a default value, so we if we don't pass an argument to it, it will still be able to run the function.
-
-The function should get the starting digits in `values`, perform any necessary clean up, and compare the results with the Benford's Law, graphically, by producing a plot we did in question (2).
-
-Note that we are simplifying things by wrapping what we did in questions (1) and (2) into a function so we can do the analysis more efficiently.
-
-Test your function on the `TRANSACTION_AMT` column from the `elections` dataset. Note that the results should be the same as question (2) -- even the title of your plot.
-
-For fair comparison, set the y-axis limits to be between 0 and 50%.
-
-[TIP]
-====
-If you called either of the `benfords_law` or `get_starting_digit` functions _within_ your `compare_to_benfords` function, consider the following.
-
-What if you shared this function with your friend, who _didn't_ have access to your `benfords_law` or `get_starting_digit` functions? It wouldn't work!
-
-Instead, it is perfectly acceptable to _declare_ your functions _inside_ your `compare_to_benfords` function. These types of functions are called _helper_ functions.
-====
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- The results of running `compare_to_benfords(elections$TRANSACTION_AMT)`.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-Let's dig into data a bit more. Using the `compare_to_benfords` function, analyze the transactions from the following entities (`ENTITY_TP`):
-
-- Candidate ('CAN'),
-- Individual - a person - ('IND'),
-- and Organization - not a committee and not a person - ('ORG').
-
-Use a loop, or one of the functions in the `apply` suite to solve this problem.
-
-Write 1-2 sentences comparing the transactions for each type of `ENTITY_TP`.
-
-Before running your code, run the following code to create a 2x2 grid for our plots.
-
-[source,r]
-----
-par(mfrow=c(1,3))
-----
-
-[TIP]
-====
-There are many ways to solve this problem.
-====
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- The results of running `compare_to_benfords(elections$TRANSACTION_AMT)`.
-- Optional: Include the name or abbreviation of the entity in its title.
-====
-
-=== Question 5
-
-Use the elections datasets and what you learned from the Benford's Law to explore the dataset more.
-
-You can compare specific states, donations to other entities, or even use datasets from other years.
-
-Explain what and why you are doing, and what are your conclusions. Be creative!
-
-.Items to submit
-====
-- R code used to solve this problem.
-- The results of running the R code.
-- 1-2 sentences explaining what and why you are doing.
-- 1-2 sentences explaining your conclusions.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project11.adoc
deleted file mode 100644
index 119f2a4ea..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project11.adoc
+++ /dev/null
@@ -1,200 +0,0 @@
-= STAT 19000: Project 11 -- Fall 2021
-
-**Motivation:** The ability to understand a problem, know what tools are available to you, and select the right tools to get the job done, takes practice. In this project we will use what you've learned so far this semester to solve data-driven problems. In previous projects, we've directed you towards certain tools. In this project, there will be less direction, and you will have the freedom to choose the tools you'd like.
-
-**Context:** You've learned lots this semester about the R environment. You now have experience using a very balanced "portfolio" of R tools. We will practice using these tools on a set of YouTube data.
-
-**Scope:** R
-
-.Learning Objectives
-****
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Utilize apply functions in order to solve a data-driven problem.
-- Gain proficiency using split, merge, and subset.
-- Comprehend what a function is, and the components of a function in R.
-- Demonstrate the ability to use nested apply functions to solve a data-driven problem.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/youtube/*`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-++++
-
-++++
-
-In project 5, we used a loop to combine the various countries youtube datasets into a single dataset called `yt` (for YouTube).
-
-Now, we've provided you with code below to create such a dataset, with a subset of the countries we want to look at.
-
-[source,r]
-----
-library(lubridate)
-
-countries <- c('US', 'DE', 'CA', 'FR')
-
-# Choose either the for loop or the sapply function for creating `yt`
-
-# EITHER use a for loop to create the data frame `yt`
-yt <- data.frame()
-for (c in countries) {
- filename <- paste0("/depot/datamine/data/youtube/", c, "videos.csv")
- dat <- read.csv(filename)
- dat$country_code <- c
- yt <- rbind(yt, dat)
-}
-
-# OR use an sapply function to create the data frame `yt`
-myDFlist <- lapply( countries, function(c) {
- dat <- read.csv(paste0("/depot/datamine/data/youtube/", c, "videos.csv"))
- dat$country_code <- c
- return(dat)} )
-yt <- do.call(rbind, myDFlist)
-
-# convert columns to date formats
-yt$trending_date <- ydm(yt$trending_date)
-yt$publish_time <- ymd_hms(yt$publish_time)
-
-# extract the trending_year and publish_year
-yt$trending_year <- year(yt$trending_date)
-yt$publish_year <- year(yt$publish_time)
-----
-
-Take a look at the `tags` column in our `yt` dataset. Create a function called `count_tags` that has an argument called `tag_vector`. Your `count_tags` function should be the count of how many unique tags the vector, `tag_vector` contains.
-
-[TIP]
-====
-Take a look at the `fixed` argument in `strsplit`.
-====
-
-You can test your function with the following code.
-
-[source,r]
-----
-tag_test <- yt$tags[2]
-tag_test
-count_tags(tag_test)
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Create a new column in your `yt` dataset called `n_tags` that contains the number of tags for the corresponding trending video.
-
-Make sure to use your `count_tags` function. Which YouTube trending video has the highest number of unique tags for videos that are trending either in the US or Germany (DE)? How many tags does it have?
-
-[TIP]
-====
-Make sure to use the `USE.NAMES` argument from `sapply` function
-====
-
-[TIP]
-====
-Begin by creating the new column `n_tags`. Then create a new dataset only for youtube videos trending in 'US' or 'DE'. For the subsetted dataset, get the YouTube trending video with highest number of tags.
-====
-
-[TIP]
-====
-It should be `video_id` with value 4AelFaljd7k.
-====
-
-**Relevant topics:** sapply, which.max, indexing, subset
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The title of the YouTube video with the highest number of tags, and the number of tags it has.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Is there an association between number of tags in a video and how many views it gets?
-
-Make a scatterplot with number of `views` in the x-axis and number of tags (`n_tags`) in the y-axis. Based on your plot, write 1-2 sentences about whether you think number of tags and number of views are associated or not.
-
-Hmmm, is a scatterplot a good choice to be able to see an association in this case? If so, explain why. If not, create a better plot for determining this, and explain why your plot is better, and try to explain if you see any association.
-
-[TIP]
-====
-`tapply` could be useful for the follow up question.
-====
-
-**Relevant topics:** sapply, which.max, indexing, subset
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- 1-2 sentences explaining if you think number of views and number of tags a youtube video has are associated or not, and why.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-Compare the average number of views and average number of comments that the YouTube trending videos have _per trending country_.
-
-Is there a different behavior between countries? Are the comparisons fair? To check if we are being fair, take a look at how many youtube trending videos we have per country.
-
-**Relevant topics:** tapply, mean
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- 1-2 sentences comparing trending countries based on average number of views and comments.
-- 1-2 sentences explaining if you think we are being fair in our comparisons, and why or why not.
-====
-
-=== Question 5
-
-How would you compare the YouTube trending videos across the different countries?
-
-Make a comparison using plots and/or summary statistics. Explain what variables are you looking at, and why you are analyzing the data the way you are. Have fun with it!
-
-[NOTE]
-====
-There are no right/wrong answers here. Just dig in a little bit and see what you can find.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- 1-2 sentences explaining your logic.
-- 1-2 sentences comparing the countries.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project12.adoc
deleted file mode 100644
index 62812ceeb..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project12.adoc
+++ /dev/null
@@ -1,191 +0,0 @@
-= STAT 19000: Project 12 -- Fall 2021
-
-**Motivation:** In the previous project you were forced to do a little bit of date manipulation. Dates can be very difficult to work with, regardless of the language you are using. `lubridate` is a package within the famous https://www.tidyverse.org/[tidyverse], that greatly simplifies some of the most common tasks one needs to perform with date data.
-
-**Context:** We've been reviewing topics learned this semester. In this project we will continue solving data-driven problems, wrangling data, and creating graphics. We will introduce a https://www.tidyverse.org/[tidyverse] package that adds great stand-alone value when working with dates.
-
-**Scope:** r
-
-.Learning Objectives
-****
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Utilize apply functions in order to solve a data-driven problem.
-- Gain proficiency using split, merge, and subset.
-- Demonstrate the ability to create basic graphs with default settings.
-- Demonstrate the ability to modify axes labels and titles.
-- Incorporate legends using legend().
-- Demonstrate the ability to customize a plot (color, shape/linetype).
-- Convert strings to dates, and format dates using the `lubridate` package.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-[WARNING]
-====
-For this project, when launching your Jupyter Lab instance, please select 5000 as the amount of memory to allocate.
-====
-
-Read the dataset into a dataframe called `liquor`.
-
-We are interested in exploring time-related trends in Iowa liquor sales. What is the data type for the column `Date`?
-
-Try to run the following code, to get the time between the first and second sale.
-
-[source,r]
-----
-liquor$Date[1] - liquor$Date[2]
-----
-
-As you may have expected, we cannot use the standard operators (like + and -) on this type.
-
-Create a new column named `date` to be the `Date` column but in date format using the function `as.Date()`.
-
-[IMPORTANT]
-====
-From this point in time on, you will have 2 "date" columns -- 1 called `Date` and 1 called `date`. `Date` will be the incorrect type for a date, and `date` will be the correct type.
-
-This allows us to see different ways to work with the data.
-====
-
-You may need to define the date format in the `as.Date()` function using the argument `format`.
-
-Try running the following code now.
-
-[source,r]
-----
-liquor$date[1] - liquor$date[2]
-----
-
-Much better! This is just 1 reason why it is important to have the data in your dataframe be of the correct type.
-
-[TIP]
-====
-Double check that the date got converted properly. The year for `liquor$date[1]` should be in 2015.
-====
-
-**Relevant topics:** `read.csv`, `fread`, `as.Date`, `str`
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Create two new columns in the dataset called `year` and `month` based on the `Date` column.
-
-Which years are covered in this dataset regarding Iowa liquor sales? Do all years have all months represented?
-
-Use the `as.Date` function again, and set the format to contain only the information wanted. See an example below.
-
-[IMPORTANT]
-====
-**Update:** It came to our attention that the `substr` method previously mentioned is _much_ less memory efficient and will cause the kernel to crash (if your project writer took the time to test _both_ ideas he had, you wouldn't have had this issue (sorry)). Please use the `as.Date` method shown below.
-====
-
-[source,r]
-----
-myDate <- as.Date('2021-11-01')
-day <- as.numeric(format(myDate,'%d'))
-----
-
-**Relevant topics:** `substr`, `as.numeric`, `format`, `unique`, `table`
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-A useful package for dealing with dates is called `lubridate`. The package is part of the famous `tidyverse` suite of packages. Run the code below to load it.
-
-[source,r]
-----
-library(lubridate)
-----
-
-Re-do questions 1 and 2 using the `lubridate` package. Make sure to name the columns differently, for example `date_lb`, `year_lb` and `month_lb`.
-
-Do you have a preference for solving the questions? Why or why not?
-
-**Relevant topics:** https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_lubridate.pdf[Lubridate Cheat Sheet]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Sentence explaining which method you prefer and why.
-====
-
-=== Question 4
-
-Now that we have the columns `year` and `month`, let's explore the data for time trends.
-
-What is the average volume (gallons) of liquor sold per month? Which month has the lowest average volume? Does that surprise you?
-
-[TIP]
-====
-You can change the labels in the x-axis to be months by having the argument `xaxt` in the plot function set as "n" (`xaxt="n"`) and then having the following code at the end of your plot: `axis(side=1, at=1:12, labels=month.abb)`.
-====
-
-**Relevant topics:** `tapply`, `plot`
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- 1-2 sentences describing your findings.
-====
-
-=== Question 5
-
-Make a line plot for the average volume sold per month for the years of 2012 to 2015. Your plot should contain 4 lines, one for each year.
-
-Make sure you specify a title, and label your axes.
-
-Write 1-2 sentences analyzing your plot.
-
-[TIP]
-====
-There are many ways to get an average per month. You can use `for` loops, `apply` suite with your own function, `subset`, and `tapply` with a grouping that involves both year and month.
-====
-
-**Relevant topics:** `plot`, `line`, `subset`, `mean`, `sapply`, `tapply`
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- 1-2 sentences analyzing your plot.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project13.adoc
deleted file mode 100644
index 72ebb31ac..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project13.adoc
+++ /dev/null
@@ -1,210 +0,0 @@
-= STAT 19000: Project 13 -- Fall 2021
-
-**Motivation:** It is always important to stay fresh and continue to hone and improve your skills. For example, games and events like https://adventofcode.com/[https://adventofcode.com/] are a great way to keep thinking and learning. Plus, you can solve the puzzles with any language you want! It can be a fun way to learn a new programming language.
-
-[quote, James Baker, ]
-____
-Proper Preparation Prevents Poor Performance.
-____
-
-In this project we will continue to wade through data, with a special focus on the apply suite of functions, building your own functions, and graphics.
-
-**Context:** This is the _last_ project of the semester! Many of you will have already finished your 10 projects, but for those who have not, this should be a fun and straightforward way to keep practicing.
-
-**Scope:** r
-
-.Learning Objectives
-****
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Utilize apply functions in order to solve a data-driven problem.
-- Gain proficiency using split, merge, and subset.
-- Demonstrate the ability to create basic graphs with default settings.
-- Demonstrate the ability to modify axes labels and titles.
-- Incorporate legends using legend().
-- Demonstrate the ability to customize a plot (color, shape/linetype).
-- Convert strings to dates, and format dates using the lubridate package.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-++++
-
-++++
-
-Run the lines of code below from project (12) to read the data and format the `year` and `month`.
-
-[source,r]
-----
-library(data.table)
-library(lubridate)
-
-liquor <- fread('/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt')
-liquor$date <- mdy(liquor$Date)
-liquor$year <- year(liquor$date)
-liquor$month <- month(liquor$date)
-----
-
-Run the code below to get a better understanding of columns `State Bottle Cost` and the `State Bottle Retail`.
-
-[source,r]
-----
-head(liquor[,c("State Bottle Cost", "State Bottle Retail")])
-typeof(liquor$`State Bottle Cost`)
-typeof(liquor$`State Bottle Retail`)
-----
-
-Create two new columns, `cost` and `retail` to be `numeric` versions of `State Bottle Cost` and the `State Bottle Retail` respectively.
-
-Once you have those two new columns, create a column called `profit` that is the profit for each sale. Which sale had the highest profit?
-
-[TIP]
-====
-There are many ways to solve the question. _Relevant topics_ contains functions to use in some possible solutions.
-====
-
-**Relevant topics:** gsub, substr, nchar, as.numeric, which.max
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The date, vendor name, number of bottles sold and profit for the sale with the highest profit.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-++++
-
-++++
-
-We want to provide useful information based on a `Vendor Number` to help in the decision making process.
-
-Create a function called `createDashboard` that takes two arguments: a specific `Vendor Number` and the `liquor` data frame, and returns a plot with the average profit per year, corresponding to the profit for that `Vendor Number`.
-
-**Relevant topics:** tapply, plot, mean
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The results of running `createDashboard(255, liquor)`.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Modify your `createDashboard` function that uses the `liquor` data frame as the default value, if the user forgets to give the name of a data frame as input to the function.
-
-We are going to start adding additional plots to your function. Run the code below first, before you run the code to build your plots. This will organize many plots in a single plot.
-
-[source,r]
-----
-par(mfrow=c(1, 2))
-----
-
-Note that we are creating a dashboard in this question with 1 row and 2 columns.
-
-Add a bar plot to your dashboard that shows the total volume sold using `Bottle Volume (ml)`.
-
-Make sure to add titles to your plots.
-
-**Relevant topics:** table, barplot
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The results of running `createDashboard(255)`.
-====
-
-=== Question 4
-
-Modify `par(mfrow=c(1, 2))` argument to be `par(mfrow=c(2, 2))` so we can fit 2 more plots in our dashboard.
-
-Create a plot that shows the average number of bottles sold per month.
-
-**Optional:** Modify the argument `mar` in `par()` to reduce the margins between the plots in our dashboard.
-
-**Relevant topics:** tapply, plot, mean
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The results of running `createDashboard(255)`.
-====
-
-=== Question 5
-
-Add a plot to complete our dashboard. Write 1-2 sentences explaining why you chose the plot in question.
-
-**Optional:** Add, remove, and/or modify the dashboard to contain information you find relevant. Make sure to document why you are making the changes.
-
-**Relevant topics:** tapply, plot, mean
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The results of running `createDashboard(255)`.
-====
-
-=== Question 6 (optional, 0 pts)
-
-`patchwork` is a very cool R package that makes for a simple and intuitive way to combine many ggplot plots into a single graphic. See https://patchwork.data-imaginist.com/[here] for details.
-
-Re-write your function `createDashboard` to use `patchwork` and `ggplot`.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 7 (optional, 0 pts)
-
-Use your `createDashboard` function to compare 2 vendors. You can print the dashboard into a pdf using the code below.
-
-[source,r]
-----
-pdf(file = "myFilename.pdf", # The directory and name you want to save the file in
- width = 8, # The width of the plot in inches
- height = 8) # The height of the plot in inches
-
-createDashboard(255)
-
-dev.off()
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-projects.adoc
deleted file mode 100644
index fd021e64f..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-projects.adoc
+++ /dev/null
@@ -1,59 +0,0 @@
-= STAT 19000
-
-== Project links
-
-[NOTE]
-====
-Only the best 10 of 13 projects will count towards your grade.
-====
-
-[CAUTION]
-====
-Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses.
-====
-
-* xref:fall2021/19000/19000-f2021-officehours.adoc[STAT 19000 Office Hours for Fall 2021]
-* xref:fall2021/19000/19000-f2021-project01.adoc[Project 1: Getting acquainted with Jupyter Lab]
-* xref:fall2021/19000/19000-f2021-project02.adoc[Project 2: Introduction to R: part I]
-* xref:fall2021/19000/19000-f2021-project03.adoc[Project 3: Introduction to R: part II]
-* xref:fall2021/19000/19000-f2021-project04.adoc[Project 4: Control flow in R]
-* xref:fall2021/19000/19000-f2021-project05.adoc[Project 5: Vectorized operations in R]
-* xref:fall2021/19000/19000-f2021-project06.adoc[Project 6: Tapply]
-* xref:fall2021/19000/19000-f2021-project07.adoc[Project 7: Base R functions]
-* xref:fall2021/19000/19000-f2021-project08.adoc[Project 8: Functions in R: part I]
-* xref:fall2021/19000/19000-f2021-project09.adoc[Project 9: Functions in R: part II]
-* xref:fall2021/19000/19000-f2021-project10.adoc[Project 10: Lists & Sapply]
-* xref:fall2021/19000/19000-f2021-project11.adoc[Project 11: Review: Focus on Sapply]
-* xref:fall2021/19000/19000-f2021-project12.adoc[Project 12: Review: Focus on basic graphics]
-* xref:fall2021/19000/19000-f2021-project13.adoc[Project 13: Review: Focus on apply suite]
-
-[WARNING]
-====
-Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:55pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete.
-
-**Always** double check that the work that you submitted was uploaded properly. After submitting your project in Gradescope, you will be able to download the project to verify that the content you submitted is what the graders will see. You will **not** get credit for or be able to re-submit your work if you accidentally uploaded the wrong project, or anything else. It is your responsibility to ensure that you are uploading the correct content.
-
-Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza.
-====
-
-== Piazza
-
-=== Sign up
-
-https://piazza.com/purdue/fall2021/stat19000
-
-=== Link
-
-https://piazza.com/purdue/fall2021/stat19000/home
-
-== Syllabus
-
-++++
-include::book:ROOT:partial$syllabus.adoc[]
-++++
-
-== Office hour schedule
-
-++++
-include::book:ROOT:partial$office-hour-schedule.adoc[]
-++++
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project01.adoc
deleted file mode 100644
index 3d9b1b754..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project01.adoc
+++ /dev/null
@@ -1,207 +0,0 @@
-= STAT 29000: Project 1 -- Fall 2021
-
-== Mark~it~down, your first project back in The Data Mine
-
-**Motivation:** It's been a long summer! Last year, you got some exposure to both R and Python. This semester, we will venture away from R and Python, and focus on UNIX utilities like `sort`, `awk`, `grep`, and `sed`. While Python and R are extremely powerful tools that can solve many problems -- they aren't always the best tool for the job. UNIX utilities can be an incredibly efficient way to solve problems that would be much less efficient using R or Python. In addition, there will be a variety of projects where we explore SQL using `sqlite3` and `MySQL/MariaDB`.
-
-We will start slowly, however, by learning about Jupyter Lab. This year, instead of using RStudio Server, we will be using Jupyter Lab. In this project we will become familiar with the new environment, review some, and prepare for the rest of the semester.
-
-**Context:** This is the first project of the semester! We will start with some review, and set the "scene" to learn about some powerful UNIX utilities, and SQL the rest of the semester.
-
-**Scope:** Jupyter Lab, R, Python, scholar, brown, markdown
-
-.Learning Objectives
-****
-- Read about and understand computational resources available to you.
-- Learn how to run R code in Jupyter Lab on Scholar and Brown.
-- Review R and Python.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-`/depot/datamine/data/`
-
-== Questions
-
-=== Question 1
-
-In previous semesters, we've used a program called RStudio Server to run R code on Scholar and solve the projects. This year, we will be using Jupyter Lab almost exclusively. Let's being by launching your own private instance of Jupyter Lab using a small portion of the compute cluster.
-
-Navigate and login to https://ondemand.anvil.rcac.purdue.edu using 2-factor authentication (ACCESS login on Duo Mobile). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward.
-
-++++
-
-++++
-
-[IMPORTANT]
-====
-In the not-to-distant future, we will be using _both_ Scholar (https://gateway.scholar.rcac.purdue.edu) _and_ Brown (https://ondemand.brown.rcac.purdue.edu) to launch Jupyter Lab instances. For now, however, we will be using Brown.
-====
-
-Towards the middle of the top menu, there will be an item labeled btn:[My Interactive Sessions], click on btn:[My Interactive Sessions]. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Datamine, Desktops, and GUIs. Under the Datamine section, you should see a button that says btn:[Jupyter Lab], click on btn:[Jupyter Lab].
-
-If everything was successful, you should see a screen similar to the following.
-
-image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"]
-
-Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand uses SLURM to launch a job to run Jupyter Lab. This job has access to 1 CPU core and 3072 Mb of memory. It is OK to not understand what that means yet, we will learn more about this in STAT 39000. For the curious, however, if you were to open a terminal session in Scholar and/or Brown and run the following, you would see your job queued up.
-
-[source,bash]
-----
-squeue -u username # replace 'username' with your username
-----
-
-After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on btn:[Connect to Jupyter] to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following.
-
-image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"]
-
-There are 2 primary options that you will need to know about.
-
-f2021-s2022::
-The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment.
-
-[TIP]
-====
-To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page].
-====
-
-f2021-s2022-r::
-An alternative, native R kernel that you can use for projects with _just_ R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell.
-
-For now, let's focus on the f2021-s2022 kernel. Click on btn:[f2021-s2022], and a fresh notebook will be created for you.
-
-Test it out! Run the following code in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on. What is the name of the node you are running on?
-
-[source,python]
-----
-import socket
-print(socket.gethostname())
-----
-
-[TIP]
-====
-To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu.
-====
-
-.Items to submit
-====
-- Code used to solve this problem in a "code" cell.
-- Output from running the code (the name of the node you are running on).
-====
-
-=== Question 2
-
-This year, the first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Scholar and Brown at `/depot/datamine/apps/templates/project_template.ipynb`).
-
-++++
-
-++++
-
-Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`.
-
-There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default?
-
-Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A.
-
-.Items to submit
-====
-- How many of each types of cells are there in the default template?
-====
-
-=== Question 3
-
-Last year, while using RStudio, you probably gained a certain amount of experience using RMarkdown -- a flavor of Markdown that allows you to embed and run code in Markdown. Jupyter Lab, while very different in many ways, still uses Markdown to add formatted text to a given notebook. It is well worth the small time investment to learn how to use Markdown, and create a neat and reproducible document.
-
-++++
-
-++++
-
-Create a Markdown cell in your notebook. Create both an _ordered_ and _unordered_ list. Create an unordered list with 3 of your favorite academic interests (some examples could include: machine learning, operating systems, forensic accounting, etc.). Create another _ordered_ list that ranks your academic interests in order of most-interested to least-interested. To practice markdown, **embolden** at least 1 item in you list, _italicize_ at least 1 item in your list, and make at least 1 item in your list formatted like `code`.
-
-[TIP]
-====
-You can quickly get started with Markdown using this cheat sheet: https://www.markdownguide.org/cheat-sheet/
-====
-
-[TIP]
-====
-Don't forget to "run" your markdown cells by clicking the small "Play" button in the notebook menu. Running a markdown cell will render the text in the cell with all of the formatting you specified. Your unordered lists will be bulleted and your ordered lists will be numbered.
-====
-
-[TIP]
-====
-If you are having trouble changing a cell due to the drop down menu behaving oddly, try changing browsers to Chrome or Safari. If you are a big Firefox fan, and don't want to do that, feel free to use the `%%markdown` magic to create a markdown cell without _really_ creating a markdown cell. Any cell that starts with `%%markdown` in the first line will generate markdown when run.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Browse https://www.linkedin.com and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown in a new Markdown cell. Include the following (at a minimum):
-
-- A header for this section (your choice of size) that says "About".
-+
-[TIP]
-====
-A Markdown header is a line of text at the top of a Markdown cell that begins with one or more `#`.
-====
-+
-- The text of your personal "About" section that you would feel comfortable uploading to LinkedIn.
-- In the about section, _for the sake of learning markdown_, include at least 1 link using Markdown's link syntax.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Read xref:templates.adoc[the templates page] and learn how to run snippets of code in Jupyter Lab _other than_ Python. Run at least 1 example of Python, R, SQL, and bash. For SQL and bash, you can use the following snippets of code to make sure things are working properly.
-
-++++
-
-++++
-
-[source, sql]
-----
--- Use the following sqlite database: /depot/datamine/data/movies_and_tv/imdb.db
-SELECT * FROM titles LIMIT 5;
-----
-
-[source,bash]
-----
-ls -la /depot/datamine/data/movies_and_tv/
-----
-
-For your R and Python code, use this as an opportunity to review your skills. For each language, choose at least 1 dataset from `/depot/datamine/data`, and analyze it. Both solutions should include at least 1 custom function, and at least 1 graphic output. Make sure your code is complete, and well-commented. Include a markdown cell with your short analysis, for each language.
-
-[TIP]
-====
-You could answer _any_ question you have about your dataset you want. This is an open question, just make sure you put in a good amount of effort. Low/no-effort solutions will not receive full credit.
-====
-
-[IMPORTANT]
-====
-Once done, submit your projects just like last year. See the xref:submissions.adoc[submissions page] for more details.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- 1-2 sentence analysis for each of your R and Python code examples.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project02.adoc
deleted file mode 100644
index a7aa14149..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project02.adoc
+++ /dev/null
@@ -1,418 +0,0 @@
-= STAT 29000: Project 2 -- Fall 2021
-
-== Navigating UNIX and using `bash`
-
-**Motivation:** The ability to navigate a shell, like `bash`, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful UNIX tools, help you navigate a filesystem, and even run UNIX tools from within your Jupyter Lab notebook.
-
-**Context:** At this point in time, our new Jupyter Lab system, using https://gateway.scholar.rcac.purdue.edu and https://gateway.brown.rcac.purdue.edu, is very new to everyone. The comfort with which you each navigate this UNIX-like operating system will vary. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within Jupyter Lab.
-
-**Scope:** bash, Jupyter Lab
-
-.Learning Objectives
-****
-- Distinguish differences in `/home`, `/scratch`, `/class`, and `/depot`.
-- Navigating UNIX via a terminal: `ls`, `pwd`, `cd`, `.`, `..`, `~`, etc.
-- Analyzing file in a UNIX filesystem: `wc`, `du`, `cat`, `head`, `tail`, etc.
-- Creating and destroying files and folder in UNIX: `scp`, `rm`, `touch`, `cp`, `mv`, `mkdir`, `rmdir`, etc.
-- Use `man` to read and learn about UNIX utilities.
-- Run `bash` commands from within Jupyter Lab.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-`/depot/datamine/data/`
-
-== Questions
-
-[IMPORTANT]
-====
-If you are not a `bash` user and you use an alternative shell like `zsh` or `tcsh`, you will want to switch to `bash` for the remainder of the semester, for consistency. Of course, if you plan on just using Jupyter Lab cells, the `%%bash` magic will use `/bin/bash` rather than your default shell, so you will not need to do anything.
-====
-
-[NOTE]
-====
-While it is not _super_ common for us to push a lot of external reading at you (other than the occasional blog post or article), https://learning.oreilly.com/library/view/learning-the-unix/0596002610[this] is an excellent, and _very_ short resource to get you started using a UNIX-like system. We strongly recommend readings chapters: 1, 3, 4, 5, & 7. It is safe to skip chapters 2, 6, and 8.
-====
-
-=== Question 1
-
-++++
-
-++++
-
-Let's ease into this project by taking some time to adjust the environment you will be using the entire semester, to your liking. Begin by launching your Jupyter Lab session from either https://gateway.scholar.rcac.purdue.edu or https://gateway.brown.rcac.purdue.edu.
-
-Explore the settings, and make at least 2 modifications to your environment, and list what you've changed.
-
-Here are some settings Kevin likes:
-
-- menu:Settings[JupyterLab Theme > JupyterLab Dark]
-- menu:Settings[Text Editor Theme > material]
-- menu:Settings[Text Editor Key Map > vim]
-- menu:Settings[Terminal Theme > Dark]
-- menu:Settings[Advanced Settings Editor > Notebook > codeCellConfig > lineNumbers > true]
-- menu:Settings[Advanced Settings Editor > Notebook > kernelShutdown > true]
-- menu:Settings[Advanced Settings Editor > Notebook > codeCellConfig > fontSize > 16]
-
-Dr. Ward does not like to customize his own environment, but he _does_ use the Emacs key bindings.
-
-- menu:Settings[Text Editor Key Map > emacs]
-
-[IMPORTANT]
-====
-Only modify your keybindings if you know what you are doing, and like to use Emacs/Vi/etc.
-====
-
-.Items to submit
-====
-- List (using a markdown cell) of the modifications you made to your environment.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-In the previous project, we used the `ls` command to list the contents of a directory as an example of running bash code using the `f2021-s2022` kernel. Aside from use the `%%bash` magic from the previous project, there are 2 more straightforward ways to run bash code from within Jupyter Lab.
-
-The first method allows you to run a bash command from within the same cell as a cell containing Python code. For example.
-
-[source,ipython]
-----
-!ls
-
-import pandas as pd
-myDF = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
-myDF.head()
-----
-
-The second, is to open up a new terminal session. To do this, go to menu:File[New > Terminal]. This should open a new tab and a shell for you to use. You can make sure the shell is working by typing your first command, `man`.
-
-[source,bash]
-----
-# man is short for manual
-# use "k" or the up arrow to scroll up, or "j" or the down arrow to scroll down.
-man man
-----
-
-What is the _absolute path_ of the default directory of your `bash` shell?
-
-**Relevant topics:** xref:book:unix:pwd.adoc[pwd]
-
-.Items to submit
-====
-- The full filepath of the default directory (home directory). Ex: Kevin's is: `/home/kamstut`.
-- The `bash` code used to show your home directory or current directory (also known as the working directory) when the `bash` shell is first launched.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-++++
-
-++++
-
-It is critical to be able to navigate a UNIX-like operating system. It is more likely than not that you will need to use a UNIX-like system at some point in your career. Perform the following actions, in order, using the `bash` shell.
-
-[NOTE]
-====
-I would recommend using a code cell with the magic `%%bash` to make sure that you are using the correct shell, and so your work is automatically saved.
-====
-
-. Write a command to navigate to the directory containing the datasets used in this course: `/depot/datamine/data/`.
-. Print the current working directory, is the result what you expected? Output the `$PWD` variable, using the `echo` command.
-. List the files within the current working directory (excluding subfiles).
-. Without navigating out of `/depot/datamine/data/`, list _all_ of the files within the the `movies_and_tv` directory, _including_ hidden files.
-. Return to your home directory.
-. Write a command to confirm that you are back in the appropriate directory.
-
-[NOTE]
-====
-`/` is commonly referred to as the root directory in a UNIX-like system. Think of it as a folder that contains _every_ other folder in the computer. `/home` is a folder within the root directory. `/home/kamstut` is the _absolute path_ of Kevin's home directory. There is a folder called `home` inside the root `/` directory. Inside `home` is another folder named `kamstut`, which is Kevin's home directory.
-====
-
-**Relevant topics:** xref:book:unix:pwd.adoc[pwd], xref:book:unix:cd.adoc[cd], xref:book:unix:echo.adoc[echo], xref:book:unix:ls.adoc[ls]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-When running the `ls` command, you may have noticed two oddities that appeared in the output: "." and "..". `.` represents the directory you are currently in, or, if it is a part of a path, it means "this directory". For example, if you are in the `/depot/datamine/data` directory, the `.` refers to the `/depot/datamine/data` directory. If you are running the following bash command, the `.` is redundant and refers to the `/depot/datamine/data/yelp` directory.
-
-[source,bash]
-----
-ls -la /depot/datamine/data/yelp/.
-----
-
-`..` represents the parent directory, relative to the rest of the path. For example, if you are in the `/depot/datamine/data` directory, the `..` refers to the parent directory, `/depot/datamine`.
-
-Any path that contains either `.` or `..` is called a _relative path_. Any path that contains the entire path, starting from the root directory, `/`, is called an _absolute path_.
-
-. Write a single command to navigate to our modulefiles directory: `/depot/datamine/opt/modulefiles`
-. Write a single command to navigate back to your home directory, however, rather than using `cd`, `cd ~`, or `cd $HOME` without the path argument, use `cd` and a _relative_ path.
-
-**Relevant topics:** xref:book:unix:pwd.adoc[pwd], xref:book:unix:cd.adoc[cd], xref:book:unix:special-symbols.adoc[. & .. & ~]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-Your `$HOME` directory is your default directory. You can navigate to your `$HOME` directory using any of the following commands.
-
-[source,bash]
-----
-cd
-cd ~
-cd $HOME
-cd /home/$USER
-----
-
-This is typically where you will work, and where you will store your work (for instance, your completed projects). At the time of writing this, the `$HOME` directories on Brown and Scholar are **not** synced. What this means is, files you create on one cluster _will not_ be available on the other cluster. To move files between clusters, you will need to copy them using `scp` or `rsync`.
-
-[NOTE]
-====
-`$HOME` and `$USER` are environment variables. You can see what they are by typing `echo $HOME` and `echo $USER`. Environment variables are variables that are set by the system, or by the user. To get a list of your terminal session's environment variables, type `env`.
-====
-
-The `depot` space is a network file system (as is the `home` space, albeit on a different system). It is attached to the root directory on all of the nodes in the cluster. One convenience that this provides is files in this space exist everywhere the filesystem is mounted! In summary, files added anywhere in `/depot/datamine` will be available on _both_ Scholar and Brown. Although you will not utilize this space _very_ often (other than to access project datasets), this is good information to know.
-
-There exists 1 more important location on each cluster, `scratch`. Your `scratch` directory is located in the same place on either cluster: `/scratch/$RCAC_CLUSTER/$USER`. `scratch` is meant for use with _really_ large chunks of data. The quota on Brown is 200TB and 2 million files. The quota on Scholar is 1TB and 2 million files. You can see your quota and usage on each system by running the following command.
-
-[source,bash]
-----
-myquota
-----
-
-[TIP]
-====
-`$RCAC_CLUSTER` and `$USER` are environment variables. You can see what they are by typing `echo $RCAC_CLUSTER` and `echo $USER`. `$RCAC_CLUSTER` contains the name of the cluster (for this course, "scholar" or "brown"), and `$USER` contains the username of the current user.
-====
-
-. Navigate to your `scratch` directory.
-. Confirm you are in the correct location using a command.
-. Execute the `tokei` command, with input `~dgc/bin`.
-+
-[NOTE]
-====
-Doug Crabill is a the compute wizard for the Statistics department here at Purdue. `~dgc/bin` is a directory he has made publicly available with a variety of useful scripts.
-====
-+
-. Output the first 5 lines and last 5 lines of `~dgc/bin/union`.
-. Count the number of lines in the bash script `~dgc/bin/union` (using a UNIX command).
-. How many bytes is the script?
-+
-[CAUTION]
-====
-Be careful. We want the size of the script, not the disk usage.
-====
-+
-. Find the location of the `tokei` command.
-
-[TIP]
-====
-When you type `myquota` on Scholar or Brown there are sometimes warnings about xauth. If you get a warning that says something like the following warning, you can safely ignore it.
-
-[quote, , Scholar/Brown]
-____
-Warning: untrusted X11 forwarding setup failed: xauth key data not generated
-____
-====
-
-[TIP]
-====
-Commands often have _options_. _Options_ are features of the program that you can trigger specifically. You can see the options of a command in the DESCRIPTION section of the man pages.
-
-[source,bash]
-----
-man wc
-----
-
-You can see -m, -l, and -w are all options for `wc`. Then, to test the options out, you can try the following examples.
-
-[source,bash]
-----
-# using the default wc command. "/depot/datamine/data/flights/1987.csv" is the first "argument" given to the command.
-wc /depot/datamine/data/flights/1987.csv
-
-# to count the lines, use the -l option
-wc -l /depot/datamine/data/flights/1987.csv
-
-# to count the words, use the -w option
-wc -w /depot/datamine/data/flights/1987.csv
-
-# you can combine options as well
-wc -w -l /depot/datamine/data/flights/1987.csv
-
-# some people like to use a single tack `-`
-wc -wl /depot/datamine/data/flights/1987.csv
-
-# order doesn't matter
-wc -lw /depot/datamine/data/flights/1987.csv
-----
-====
-
-**Relevant topics:** xref:book:unix:pwd.adoc[pwd], xref:book:unix:cd.adoc[cd], xref:book:unix:head.adoc[head], xref:book:unix:tail.adoc[tail], xref:book:unix:wc.adoc[wc], xref:book:unix:du.adoc[du], xref:book:unix:which.adoc[which], xref:book:unix:type.adoc[type]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 6
-
-++++
-
-++++
-
-Perform the following operations.
-
-. Navigate to your scratch directory.
-. Copy the following file to your current working directory: `/depot/datamine/data/movies_and_tv/imdb.db`.
-. Create a new directory called `movies_and_tv` in your current working directory.
-. Move the file, `imdb.db`, from your scratch directory to the newly created `movies_and_tv` directory (inside of scratch).
-. Use `touch` to create a new, empty file called `im_empty.txt` in your scratch directory.
-. Remove the directory, `movies_and_tv`, from your scratch directory, including _all_ of the contents.
-. Remove the file, `im_empty.txt`, from your scratch directory.
-
-**Relevant topics:** xref:book:unix:cp.adoc[cp], xref:book:unix:rm.adoc[rm], xref:book:unix:touch.adoc[touch], xref:book:unix:cd.adoc[cd]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 7
-
-++++
-
-++++
-
-[IMPORTANT]
-====
-This question should be performed by opening a terminal window. menu:File[New > Terminal]. Enter the result/content in a markdown cell in your notebook.
-====
-
-Tab completion is a feature in shells that allows you to tab through options when providing an argument to a command. It is a _really_ useful feature, that you may not know is there unless you are told!
-
-Here is the way it works, in the most common case -- using `cd`. Have a destination in mind, for example `/depot/datamine/data/flights/`. Type `cd /depot/d`, and press tab. You should be presented with a large list of options starting with `d`. Type `a`, then press tab, and you will be presented with an even smaller list. This time, press tab repeatedly until you've selected `datamine`. You can then continue to type and press tab as needed.
-
-Below is an image of the absolute path of a file in the Data Depot. Use `cat` and tab completion to print the contents of that file.
-
-image::figure03.webp[Tab completion, width=792, height=250, loading=lazy, title="Tab completion"]
-
-.Items to submit
-====
-- The content of the file, `hello_there.txt`, in a markdown cell in your notebook.
-====
-
-=== Question 8 (optional, 0 pts, but recommended)
-
-++++
-
-++++
-
-[IMPORTANT]
-====
-For this question, you will most likely want to launch a terminal. To launch a terminal click on menu:File[New > Terminal]. No need to input this question in your notebook.
-====
-
-. Use `vim`, `emacs`, or `nano` to create a new file in your scratch directory called `im_still_here.sh`. Add the following contents to the file, save, and close it.
-+
-[source,bash]
-----
-#!/bin/bash
-
-i=0
-
-while true
-do
- echo "I'm still here! Count: $i"
- sleep 1
- ((i+=1))
-done
-----
-+
-. Confirm the contents of the file using `cat`.
-. Try and run the program by typing `im_still_here.sh`.
-+
-[NOTE]
-====
-As you can see, simply typing `im_still_here.sh` will not work. You need to run the program with `./im_still_here.sh`. The reason is, by default, the operating system looks at the locations in your `$PATH` environment variable for executables to execute. `im_still_here.sh` is not in your `$PATH` environment variable, so it will not be found. In order to make it clear _where_ the program is, you need to run it with `./`.
-====
-+
-. Instead, try and run the program by typing `./im_still_here.sh`.
-+
-[NOTE]
-====
-Uh oh, another warning. This time, you get a warning that says something like "permission denied". In order to execute a program, you need to grant the program execute permissions. To grant execute permissions for your program, run the following command.
-
-[source,bash]
-----
-chmod +x im_still_here.sh
-----
-====
-+
-. Try and run the program by typing `./im_still_here.sh`.
-. The program should begin running, printing out a count every second.
-. Suspend the program by typing kbd:[Ctrl+Z].
-. Run the program again by typing `./im_still_here.sh`, then suspend it again.
-. Run the command, `jobs`, to see the jobs you have running.
-. To continue running a job, use either the `fg` command or `bg` command.
-+
-[TIP]
-====
-`fg` stands for foreground and `bg` stands for background.
-
-`fg %1` will continue to run job 1 in the foreground. During this time you will not have the shell available for you to use. To re-suspend the program, you can press kbd:[Ctrl+Z] again.
-
-`bg %1` will run job 1 in the background. During this time the shell will be available to use. Try running `ls` to demonstrate. Note that the program, although running in the background, will still be printing to your screen. Although annoying, you can still run and use the shell. In this case, however, you will most likely want to stop running this program in the background due to its disruptive behavior. kdb:[Ctrl+Z] will will no longer suspend the program, because this program is running in the background, not foreground. To suspend the program, first send it to the foreground with `fg %1`, _then_ use kbd:[Ctrl+Z] to suspend it.
-====
-
-Experiment moving the jobs to the foreground, background, and suspended until you feel comfortable with it. It is a handy trick to learn!
-
-[TIP]
-====
-By default, a program is launched in the foreground. To run a program in the background at the start, and the command with a `&`, like in the following example.
-
-[source,bash]
-----
-./im_still_here.sh &
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem. Since you will need to use kbd:[Ctrl+Z], and things of that nature, when what you are doing isn't "code", just describe what you are did. For example, if I press kbd:[Ctrl+Z], I would say "I pressed kbd:[Ctrl+Z]".
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project03.adoc
deleted file mode 100644
index 76027c880..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project03.adoc
+++ /dev/null
@@ -1,191 +0,0 @@
-= STAT 29000: Project 3 -- Fall 2021
-
-== Regular expressions, irregularly satisfying, introduction to `grep` and regular expressions
-
-**Motivation:** The need to search files and datasets based on the text held within is common during various parts of the data wrangling process -- after all, projects in industry will not typically provide you with a path to your dataset and call it a day. `grep` is an extremely powerful UNIX tool that allows you to search text using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/[even professionals can make critical mistakes]. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in.
-
-[NOTE]
-====
-Regular expressions are not something you will be able to completely escape from. They exist in some way, shape, and form in all major programming languages. Even if you are less-interested in UNIX tools (which you shouldn't be, they can be awesome), you should definitely take the time to learn regular expressions.
-====
-
-**Context:** We've just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, `grep`, and experiment with regular expressions using `grep`, R, and later on, Python.
-
-**Scope:** `grep`, regular expression basics, utilizing regular expression tools in R and Python
-
-.Learning Objectives
-****
-- Use `grep` to search for patterns within a dataset.
-- Use `cut` to section off and slice up data from the command line.
-- Use `wc` to count the number of lines of input.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-`/anvil/projects/tdm/data/consumer_complaints/complaints.csv`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-`grep` stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate `grep`, we will be using it with textual data.
-
-Let's assume for a second that we _didn't_ provide you with the location of this projects dataset, and you didn't know the name of the file either. With all of that being said, you _do_ know that it is the only dataset with the text "That's the sort of fraudy fraudulent fraud that Wells Fargo defrauds its fraud-victim customers with. Fraudulently." in it. (When you search for this sentence in the file, make sure that you type the single quote in "That's" so that you get a regular ASCII single quote. Otherwise, you will not find this sentence.)
-
-Write a `grep` command that finds the dataset. You can start in the `/depot/datamine/data` directory to reduce the amount of text being searched. In addition, use a wildcard to reduce the directories we search to only directories that start with a `c` inside the `/depot/datamine/data` directory.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-In the previous project, you learned about a command that could quickly print out the first _n_ lines of a file. A csv file typically has a header row to explain what data each column holds. Use the command you learned to print out the first line of the file, and _only_ the first line of the file.
-
-Great, now that you know what each column holds, repeat question (1), but, format the output so that it shows the `complaint_id`, `consumer_complaint_narrative`, and the `state`.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Imagine a scenario where we are dealing with a _much_ bigger dataset. Imagine that we live in the southeast and are really only interested in analyzing the data for Florida, Georgia, Mississippi, Alabama, and South Carolina. In addition, we are only interested in in the `complaint_id`, `state`, `consumer_complaint_narrative`, and `tags`.
-
-Use UNIX tools to, in one line, create a _new_ dataset called `southeast.csv` that only contains the data for the five states mentioned above, and only the columns listed above.
-
-[TIP]
-====
-Be careful you don't accidentally get lines with a word like "CAPITAL" in them (AL is the state code of Alabama and is present in the word "CAPITAL").
-====
-
-How many rows of data remain? How many megabytes is the new file? Use `cut` to isolate _just_ the data we ask for. For example, _just_ print the number of rows, and _just_ print the value (in Mb) of the size of the file.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-We want to isolate some of our southeast complaints. Return rows from our new dataset, `southeast.csv`, that have one of the following words: "wow", "irritating", or "rude" followed by at least 1 exclamation mark. Do this with just a single `grep` command. Ignore case (whether or not parts of the "wow", "rude", or "irritating" words are capitalized or not).
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-If you pay attention to the `consumer_complaint_narrative` column, you'll notice that some of the narratives contain dollar amounts in curly braces `{` and `}`. Use `grep` to find the narratives that contain at least one dollar amount enclosed in curly braces. Use `head` to limit output to only the first 5 results.
-
-[TIP]
-====
-Use the option `--color=auto` to get some nice, colored output (if using a terminal).
-====
-
-[TIP]
-====
-Use the option `-E` to use extended regular expressions. This will make your regular expressions less messy (less escaping).
-====
-
-[NOTE]
-====
-There are instances like `{>= $1000000}` and `{ XXXX }`. The first example qualifies, but the second doesn't. Make sure the following are matched:
-
-- {$0.00}
-- { $1,000.00 }
-- {>= $1000000}
-- { >= $1000000 }
-
-And that the following are _not_ matched:
-
-- { XXX }
-- {XXX}
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 6
-
-As mentioned earlier on, every major language has some sort of regular expression package. Use either the `re` package in Python (or string methods in `pandas`, for example, `findall`), or the `grep`, `grepl`, and `stringr` packages in R to perform the same operation in question (5).
-
-[TIP]
-====
-If you are using `pandas`, there will be 3 types of results: lists of strings, empty lists, and `NA` values. You can convert your empty lists to `NA` values like this.
-
-[source,python]
-----
-dat['amounts'] = dat['amounts'].apply(lambda x: pd.NA if x==[] else x)
-----
-
-Then, dat['amounts'] will be a `pandas` Series with values `pd.NA` or a list of strings. Which you can filter like this.
-
-[source,python]
-----
-dat['amounts'].loc[dat['amounts'].notna()]
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 7 (optional, 0 pts)
-
-As mentioned earlier on, every major language has some sort of regular expression package. Use either the `re` package in Python, or the `grep`, `grepl`, and `stringr` packages in R to create a new column in your data frame (`pandas` or R data frame) named `amounts` that contains a semi-colon separated string of dollar amounts _without_ the dollar sign. For example, if the dollar amounts are $100, $200, and $300, the amounts column should contain `100.00;200.00;300.00`.
-
-[TIP]
-====
-One good way to do this is to use the `apply` method on the `pandas` Series.
-
-[source,python]
-----
-dat['amounts'] = dat['amounts'].apply(some_function)
-----
-====
-
-[TIP]
-====
-This is one way to test if a value is `NA` or not.
-
-[source,python]
-----
-isinstance(my_list, type(pd.NA))
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project04.adoc
deleted file mode 100644
index 2e953f88e..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project04.adoc
+++ /dev/null
@@ -1,164 +0,0 @@
-= STAT 29000: Project 4 -- Fall 2021
-
-== Extracting and summarizing data in bash
-
-**Motivation:** Becoming comfortable chaining commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc. While it is always fair to whip together a script using your favorite language, you may find that these UNIX tools are a better fit for your needs.
-
-**Context:** We've been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping.
-
-**Scope:** grep, regular expression basics, UNIX utilities, redirection, piping
-
-.Learning Objectives
-****
-- Use `cut` to section off and slice up data from the command line.
-- Use piping to string UNIX commands together.
-- Use `sort` and it's options to sort data in different ways.
-- Use `head` to isolate n lines of output.
-- Use `wc` to summarize the number of lines in a file or in output.
-- Use `uniq` to filter out non-unique lines.
-- Use `grep` to search files effectively.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/stackoverflow/unprocessed/*`
-- `/depot/datamine/data/stackoverflow/processed/*`
-- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-One of the first things to do when first looking at a dataset is reading the first few lines of data in the file. Typically, there will be some headers which describe the data, _and_ you get to see what some of the data looks like. Use the UNIX `head` command to read the first few lines of the data in `unprocessed/2011.csv`.
-
-As you will quickly see, this dataset is just too wide -- there are too many columns -- to be useful. Let's try and count the number of columns using `head`, `tr`, and `wc`. If we can get the first row, replace `,`'s with newlines, then use `wc -l` to count the number of lines, this should work, right? What happens?
-
-[TIP]
-====
-The newline character in UNIX is `\n`.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-As you can see, csv files are not always so straightforward to parse. For this particular set of questions, we want to focus on using other UNIX tools that are more useful on semi-clean datasets. Take a look at the first few lines of the data in `processed/2011.csv`. How many columns are there?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Let's switch gears, and look at a larger dataset with more data to analyze. Check out `iowa_liquor_sales_cleaner.txt`. What are the 5 largest orders by number of bottles sold?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-What are the different sizes (in ml) that a bottle of liquor comes in?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Which store has the most invoices? There are 2 columns you could potentially use to solve this problem, which should you use and why? For this dataset, does it end up making a difference?
-
-[NOTE]
-====
-This may take a few minutes to run. Grab a coffee. To prevent wasting time, try practicing on the `head` of the data instead of the entire data.
-====
-
-[IMPORTANT]
-====
-Be _very_ careful when using `uniq`. Read the man pages for `uniq`, otherwise, you may not get the correct solution.
-
-[source,bash]
-----
-man uniq
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 6
-
-`sort` is a particularly powerful function, albeit not always the most user friendly when compared to other tools.
-
-For the largest sale (in USD), what was the volume sold in liters?
-
-For the largest sale (in liters of liquor sold), what was the total cost (in USD)?
-
-[TIP]
-====
-Use the `-k` option with sort to solve these questions.
-====
-
-[TIP]
-====
-To remove a dollar sign from text using `tr`, do the following.
-
-[source,bash]
-----
-tr -d '$'
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 7
-
-Use `head`, `grep`, `sort`, `uniq`, `wc`, and any other UNIX utilities you feel comfortable using to answer a data-driven question about the `iowa_liquor_sales_cleaner.txt` dataset.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project05.adoc
deleted file mode 100644
index b2c971096..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project05.adoc
+++ /dev/null
@@ -1,199 +0,0 @@
-= STAT 29000: Project 5 -- Fall 2021
-
-**Motivation:** `awk` is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like `awk`.
-
-**Context:** This is the first project where we introduce `awk`. `awk` is a powerful tool that can be used to perform a variety of the tasks that we've previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner.
-
-**Scope:** awk, UNIX utilities
-
-.Learning Objectives
-****
-- Use awk to process and manipulate textual data.
-- Use piping and redirection within the terminal to pass around data between utilities.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-While the UNIX tools we've used up to this point are very useful, `awk` enables many new capabilities, and can even replace major functionality of other tools.
-
-In a previous question, we asked you to write a command that printed the number of columns in the dataset. Perform the same operation using `awk`.
-
-Similarly, we've used `head` to print the header line. Use `awk` to do the same.
-
-Similarly, we've used `wc` to count the number of lines in the dataset. Use `awk` to do the same.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-In a previous question, we used `sort` in combination with `uniq` to find the stores with the most number of sales.
-
-Use `awk` to find the 10 stores with the most number of sales. In a previous solution, our output was minimal -- we had a count and a store number. This time, take some time to format the output nicely, _and_ use the store number to find the count (not store name).
-
-[TIP]
-====
-Sorting an array by values in `awk` can be confusing. Check out https://stackoverflow.com/questions/5342782/sort-associative-array-with-awk[this excellent stackoverflow post] to see a couple of ways to do this. "Edit 2" is the easiest one to follow.
-====
-
-[NOTE]
-====
-You can even use the store number to count the number of sales and save the most recent store name for the store number as you go to _print_ the store names with the output.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-Calculate the total number of sales (in USD) by county. Do this using any UNIX commands you have available. Then, do this using _only_ `awk`.
-
-[TIP]
-====
-`gsub` is a powerful awk utility that allows you to replace a string with another string. For example, you could replace all `$`'s in field 2 with nothing by:
-
-----
-gsub(/\$/, "", $2)
-----
-====
-
-[NOTE]
-====
-The `gsub` operation happens in-place. In a nutshell, what this means is that the original field, `$2` is replaced with the result of the `gsub` operation.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-Use `awk` and piping to create a new dataset with the following columns, for every store, by month:
-
-- `month_number`: the month number (01-12)
-- `year`: the year (4-digit year, e.g., 2015)
-- `store_name`: store name
-- `volume_sold`: total volume sold
-- `sold_usd`: total amount sold in USD
-
-Call the new dataset `sales_by_store.csv`.
-
-[TIP]
-====
-Feel free to use the store name as a key for simplicity.
-====
-
-[TIP]
-====
-`split` is another powerful function in `awk` that allows you to split a string into multiple fields. You could, for example, extract the year from the date field as follows.
-
-[source,awk]
-----
-split($2, dates, "/", seps);
-----
-
-Then, you can access the year using `dates[3]`.
-====
-
-[TIP]
-====
-You can use multiple values as a key in `awk`. This is a cool trick to count or calculate something by year, for example.
-
-[source,awk]
-----
-myarray[$4dates[3]]++
-----
-
-Here, `$4` is the 4th field, `dates[3]` is the year. The resulting key would be something like "My Store Name2014", and we would have a new key (and associated value) for each store/year combination. In the provided code (below), Dr Ward suggests the use of a triple key, which includes the store name, the month, and the year.
-====
-
-[TIP]
-====
-Dr Ward walks you through a method of solution for this problem, in the video
-
-[source,awk]
-----
-cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt |
- awk -F\; 'BEGIN{ print "store_name;month_number;year;sold_usd;volume_sold" }
- {gsub(/\$/, "", $22); split($2, dates, "/", seps);
- mysales[$4";"dates[1]";"dates[3]] += $22;
- myvolumes[$4";"dates[1]";"dates[3]] += $24;
- }
- END{ for (mytriple in mysales) {print mytriple";"mysales[mytriple]";"myvolumes[mytriple]}}' >sales_by_store.csv
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Use `awk` to count how many times each store has sold more than $500,000 in a month. Output should be similar to the following. Sort the output from highest count to lowest.
-
-----
-store_name,count
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project06.adoc
deleted file mode 100644
index ac4a5677d..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project06.adoc
+++ /dev/null
@@ -1,641 +0,0 @@
-= STAT 29000: Project 6 -- Fall 2021
-
-== The anatomy of a bash script
-
-**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
-
-**Context:** This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
-
-**Scope:** awk, bash scripts, UNIX utilities
-
-.Learning Objectives
-****
-- Use awk to process and manipulate textual data.
-- Use piping and redirection within the terminal to pass around data between utilities.
-- Write bash scripts to automate potential repeated tasks.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/election/*`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-[IMPORTANT]
-====
-Originally, this project was a bit more involved than intended. For this reason, I have provided you the solution to the question below the last "note" in this question. Instead of writing this script, I would like you to study it and try and understand what is going on.
-====
-
-We now have a grip on a variety of useful tools, that are often used together using pipes and redirections. As you start to see, "one-liners" can start to become a bit unwieldy. In these cases, wrapping everything into a bash script can be a good solution.
-
-Imagine for a minute, that you have a single file that is continually appended to by another system. Let's say this file is `/depot/datamine/data/election/itcont1990.txt`. Every so often, your manager asks you to generate a summary of the data in this file. Every time you do this, you have to dig through old notes to remember how you did this previously. Instead of constantly doing this manual process, you decide to write a script to handle this for you!
-
-Write a bash script to generate a summary of the data in `/depot/datamine/data/election/itcont1990.txt`. The summary should include the following information, in the following format.
-
-....
-120 RECORDS READ
-----------------
-File: /depot/datamine/data/election/itcont1990.txt
-Largest donor:
-Most common donor state: NY
-Total donations in USD by state:
-- NY: 100000
-- CA: 50000
-...
-----------------
-....
-
-[NOTE]
-====
-For this question, assume that the data file will _always_ be in the same location.
-====
-
-[source,bash]
-----
-#!/bin/bash
-
-FILE=/depot/datamine/data/election/itcont1990.txt
-
-RECORDS_READ=`wc -l $FILE | awk '{print $1}'`
-
-awk -v RECORDS_READ="$RECORDS_READ" -F'|' 'BEGIN{
- print RECORDS_READ" RECORDS READ\n----------------";
-}{
- donor_total_by_name[$8] += $15;
- most_common_donor_by_state[$10]++;
- donor_total_by_state[$10] += $15;
-}END{
- PROCINFO["sorted_in"] = "@val_num_desc";
- print "File: "FILENAME;
-
- ct=0;
-
- for (i in donor_total_by_name) {
- if (ct < 1) {
- print "Largest donor: " i;
- ct++;
- }
- };
-
- ct=0;
-
- for (i in most_common_donor_by_state) {
- if (ct < 1) {
- print "Most common donor state: " i;
- ct++;
- }
- }
-
- print "Total donations in USD by state:";
-
- for (i in donor_total_by_state) {
- if (i != "STATE" && i != "") {
- print "\t- " i ": " donor_total_by_state[i];
- }
- }
-
- print "----------------";
-
-}' "$FILE"
-----
-
-In order to run this script, you will need to paste the contents into a new file called `firstname-lastname-q1.sh` in your `$HOME` directory. In a new bash cell, run it as follows.
-
-[source,ipython]
-----
-%%bash
-
-chmod +x $HOME/firstname-lastname-q1.sh
-$HOME/firstname-lastname-q1.sh
-----
-
-That `chmod` command is necessary to ensure that you can execute the script.
-
-Create the script and run the script in a bash cell.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-++++
-
-++++
-
-Your manager loves your script, but wants you to modify it so it works with any file formatted the same way. A new system is being installed that saves new data into new files rather than appending to the same file.
-
-Modify the script from question (1) to accept an argument that specifies the file to process.
-
-Start by copying the cold script from question (1) into a new file called `firstname-lastname-q2.sh`.
-
-[source,ipython]
-----
-%%bash
-
-cp $HOME/firstname-lastname-q1.sh $HOME/firstname-lastname-q2.sh
-----
-
-Then, test the updated script out on `/depot/datamine/data/election/itcont2000.txt`.
-
-[source,ipython]
-----
-%%bash
-
-$HOME/firstname-lastname-q2.sh /depot/datamine/data/election/itcont2000.txt
-----
-
-[TIP]
-====
-You can edit your scripts directly within Jupyter Lab by right clicking the files and opening in the editor.
-====
-
-[TIP]
-====
-The only difference between the two scripts are the new script you will be able to replace the $FILE argument to the `wc` command with something else.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-++++
-
-++++
-
-Modify your script once again to accept _n_ arguments, each a path to another file to generate a summary for.
-
-Start by copying the cold script from question (2) into a new file called `firstname-lastname-q3.sh`.
-
-[source,ipython]
-----
-%%bash
-
-cp $HOME/firstname-lastname-q2.sh $HOME/firstname-lastname-q3.sh
-----
-
-You should be able to run the script as follows.
-
-[source,ipython]
-----
-%%bash
-
-$HOME/firstname-lastname-q3.sh /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt
-----
-
-....
-155 RECORDS READ
-----------------
-File: /depot/datamine/data/election/itcont2000.txt
-Largest donor:
-Most common donor state: NY
-Total donations in USD by state:
-- NY: 100000
-- CA: 50000
-...
-----------------
-
-120 RECORDS READ
-----------------
-File: /depot/datamine/data/election/itcont1990.txt
-Largest donor:
-Most common donor state: NY
-Total donations in USD by state:
-- NY: 100000
-- CA: 50000
-...
-----------------
-....
-
-[TIP]
-====
-Again, the modification that will need to be made here aren't so bad at all! If you just wrap the entirety of question (2)'s solution in a for loop where you loop through each argument, you'll just need to make sure you change the $FILE argument to the `wc` command to be the argument you are setting in each loop.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-[IMPORTANT]
-====
-Originally, this project was a bit more involved than intended. For this reason, I have provided you the solution to the question below the last "tip" in this question. Instead of writing this script, I would like you to study it and try and understand what is going on, and run the example we provide.
-====
-
-You are _particularly_ interested in donors from your alma mater, https://purdue.edu[Purdue University]. Modify your script from question (3) yet again. This time, add a flag, that, when present, will include the name and amount for each donor where the word "purdue" (case insensitive) is present in the `EMPLOYER` column.
-
-[source,ipython]
-----
-%%bash
-
-$HOME/firstname-lastname-q4.sh -p /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt
-----
-
-....
-155 RECORDS READ
-----------------
-File: /depot/datamine/data/election/itcont2000.txt
-Largest donor: ASARO, SALVATORE
-Most common donor state: NY
-Purdue donors:
-- John Smith: 500
-- Alice Bob: 1000
-Total donations in USD by state:
-- NY: 100000
-- CA: 50000
-...
-----------------
-
-120 RECORDS READ
-----------------
-File: /depot/datamine/data/election/itcont1990.txt
-Largest donor: ASARO, SALVATORE
-Most common donor state: NY
-Purdue donors:
-- John Smith: 500
-- Alice Bob: 1000
-Total donations in USD by state:
-- NY: 100000
-- CA: 50000
-...
-----------------
-....
-
-[TIP]
-====
-https://stackoverflow.com/a/29754866[This] stackoverflow response has an excellent template using `getopt` to parse your flags. Use this as a "start".
-====
-
-[TIP]
-====
-You may want to comment out or delete the part of the template that limits your non-flag arguments to one.
-====
-
-[source,bash]
-----
-#!/bin/bash
-
-# More safety, by turning some bugs into errors.
-# Without `errexit` you don’t need ! and can replace
-# PIPESTATUS with a simple $?, but I don’t do that.
-set -o errexit -o pipefail -o noclobber -o nounset
-
-# -allow a command to fail with !’s side effect on errexit
-# -use return value from ${PIPESTATUS[0]}, because ! hosed $?
-! getopt --test > /dev/null
-if [[ ${PIPESTATUS[0]} -ne 4 ]]; then
- echo 'I’m sorry, `getopt --test` failed in this environment.'
- exit 1
-fi
-
-OPTIONS=p
-LONGOPTS=purdue
-
-# -regarding ! and PIPESTATUS see above
-# -temporarily store output to be able to check for errors
-# -activate quoting/enhanced mode (e.g. by writing out “--options”)
-# -pass arguments only via -- "$@" to separate them correctly
-! PARSED=$(getopt --options=$OPTIONS --longoptions=$LONGOPTS --name "$0" -- "$@")
-if [[ ${PIPESTATUS[0]} -ne 0 ]]; then
- # e.g. return value is 1
- # then getopt has complained about wrong arguments to stdout
- exit 2
-fi
-# read getopt’s output this way to handle the quoting right:
-eval set -- "$PARSED"
-
-p=n
-# now enjoy the options in order and nicely split until we see --
-while true; do
- case "$1" in
- -p|--purdue)
- p=y
- shift
- ;;
- --)
- shift
- break
- ;;
- *)
- echo "Programming error"
- exit 3
- ;;
- esac
-done
-
-# handle non-option arguments
-# if [[ $# -ne 1 ]]; then
-# echo "$0: A single input file is required."
-# exit 4
-# fi
-
-for file in "$@"
-do
- RECORDS_READ=`wc -l $file | awk '{print $1}'`
-
- awk -v PFLAG="$p" -v RECORDS_READ="$RECORDS_READ" -F'|' 'BEGIN{
- print RECORDS_READ" RECORDS READ\n----------------";
- }{
-
- if ($8 != "") {
- donor_total_by_name[$8] += $15;
- }
- most_common_donor_by_state[$10]++;
- donor_total_by_state[$10] += $15;
-
- # see if "purdue" appears in line
- if (PFLAG == "y") {
- has_purdue = match(tolower($0), /purdue/)
- if (has_purdue != 0) {
- purdue_total_by_name[$8] += $15;
- }
- }
-
- }END{
- PROCINFO["sorted_in"] = "@val_num_desc";
- print "File: "FILENAME;
-
- ct=0;
-
- for (i in donor_total_by_name) {
- if (ct < 1) {
- print "Largest donor: " i;
- ct++;
- }
- };
-
- ct=0;
-
- for (i in most_common_donor_by_state) {
- if (ct < 1) {
- print "Most common donor state: " i;
- ct++;
- }
- }
-
- if (PFLAG == "y") {
- print "Purdue donors:";
- for (i in purdue_total_by_name) {
- print "\t- " i ": " purdue_total_by_name[i];
- }
- }
-
- print "Total donations in USD by state:";
-
- for (i in donor_total_by_state) {
- if (i != "STATE" && i != "") {
- print "\t- " i ": " donor_total_by_state[i];
- }
- }
-
- print "----------------\n";
-
- }' $file
-done
-----
-
-Please copy and paste this code into a new script called `firstname-lastname-q4.sh` and run it.
-
-[source,ipython]
-----
-%%bash
-
-$HOME/firstname-lastname-q4.sh -p /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-[IMPORTANT]
-====
-Originally, this project was a bit more involved than intended. Instead of writing this script from scratch, I would like you to fill in the parts of the script with the text FIXME, and then test out the script with the commands provided.
-====
-
-Your manager liked that new feature, however, she thinks the tool would be better suited to search the `EMPLOYER` column for a specific string, and then handle this generically, rather than just handling the specific case of Purdue.
-
-Modify your script from question (4). Accept one and only one flag `-e` or `--employer`. This flag should take a string as an argument, and then search the `EMPLOYER` column for that string. Then, the script will print out the results. Only include the top 5 donors from an employer. The following is an example if we chose to search for "ford".
-
-[source,bash]
-----
-$HOME/firstname-lastname-q5.sh -e'ford' /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt
-----
-
-....
-155 RECORDS READ
-----------------
-File: /depot/datamine/data/election/itcont1990.txt
-Largest donor: ASARO, SALVATORE
-Most common donor state: NY
-ford donors:
-- John Smith: 500
-- Alice Bob: 1000
-Total donations in USD by state:
-- NY: 100000
-- CA: 50000
-...
-----------------
-
-120 RECORDS READ
-----------------
-File: /depot/datamine/data/election/itcont2000.txt
-Largest donor: ASARO, SALVATORE
-Most common donor state: NY
-ford donors:
-- John Smith: 500
-- Alice Bob: 1000
-Total donations in USD by state:
-- NY: 100000
-- CA: 50000
-...
-----------------
-....
-
-[source,bash]
-----
-#!/bin/bash
-
-# More safety, by turning some bugs into errors.
-# Without `errexit` you don’t need ! and can replace
-# PIPESTATUS with a simple $?, but I don’t do that.
-set -o errexit -o pipefail -o noclobber -o nounset
-
-# -allow a command to fail with !’s side effect on errexit
-# -use return value from ${PIPESTATUS[0]}, because ! hosed $?
-! getopt --test > /dev/null
-if [[ ${PIPESTATUS[0]} -ne 4 ]]; then
- echo 'I’m sorry, `getopt --test` failed in this environment.'
- exit 1
-fi
-
-OPTIONS=e:
-LONGOPTS=employer:
-
-# -regarding ! and PIPESTATUS see above
-# -temporarily store output to be able to check for errors
-# -activate quoting/enhanced mode (e.g. by writing out “--options”)
-# -pass arguments only via -- "$@" to separate them correctly
-! PARSED=$(getopt --options=$OPTIONS --longoptions=$LONGOPTS --name "$0" -- "$@")
-if [[ ${PIPESTATUS[0]} -ne 0 ]]; then
- # e.g. return value is 1
- # then getopt has complained about wrong arguments to stdout
- exit 2
-fi
-# read getopt’s output this way to handle the quoting right:
-eval set -- "$PARSED"
-
-e=-
-# now enjoy the options in order and nicely split until we see --
-while true; do
- case "$1" in
- -e|--employer)
- e="$2"
- shift 2
- ;;
- --)
- shift
- break
- ;;
- *)
- echo "Programming error"
- exit 3
- ;;
- esac
-done
-
-# handle non-option arguments
-# if [[ $# -ne 1 ]]; then
-# echo "$0: A single input file is required."
-# exit 4
-# fi
-
-for file in "$@"
-do
- RECORDS_READ=`wc -l $file | awk '{print $1}'`
-
- awk -v EFLAG="$FIXME" -v RECORDS_READ="$RECORDS_READ" -F'|' 'BEGIN{ <1>
- print RECORDS_READ" RECORDS READ\n----------------";
- }
- {
-
- if ($8 != "") {
- donor_total_by_name[$8] += $15;
- }
- most_common_donor_by_state[$10]++;
- donor_total_by_state[$10] += $15;
-
- # see if search string appears in line
- if (EFLAG != "") {
- has_string = match(tolower($12), EFLAG)
- if (has_string != 0) {
- employer_total_by_name[$8] += $15;
- }
- }
-
- }END{
- PROCINFO["sorted_in"] = "@val_num_desc";
- print "File: "FILENAME;
-
- ct=0;
-
- for (i in donor_total_by_name) {
- if (ct < 1) {
- print "Largest donor: " i;
- ct++;
- }
- };
-
- ct=0;
-
- for (i in most_common_donor_by_state) {
- if (ct < 1) {
- print "Most common donor state: " i;
- ct++;
- }
- }
-
- ct=0;
-
- if (EFLAG != "") {
- print EFLAG" donors:";
- for (i in FIXME) { <2>
- if (ct < 5) {
- print "\t- " i ": " FIXME[i]; <3>
- FIXME; <4>
- }
- }
- }
-
- print "Total donations in USD by state:";
-
- for (i in donor_total_by_state) {
- if (i != "STATE" && i != "") {
- print "\t- " i ": " donor_total_by_state[i];
- }
- }
-
- print "----------------\n";
-
- }' $file
-done
-----
-
-<1> We should put "$something" here -- check out how we handle this is question (4) and look at the changes it question (5) to help isolate what goes here.
-<2> What are we looping through here? All you need to do is change it to the only remaining `awk` array we haven't looped through in the rest of the code.
-<3> Now we want to access the _value_ of the array -- it would make sense if it were the same array as the previous FIXME, right?!
-<4> Without this code, we will print ALL of the donors -- not just the first 5.
-
-Then test it out!
-
-[source,ipython]
-----
-%%bash
-
-$HOME/firstname-lastname-q5.sh -e'ford' /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project07.adoc
deleted file mode 100644
index f6a50bf4c..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project07.adoc
+++ /dev/null
@@ -1,341 +0,0 @@
-= STAT 29000: Project 7 -- Fall 2021
-:page-mathjax: true
-
-== Bashing out liquor sales data
-
-**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential issues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
-
-**Context:** This is the second project in a series of projects focused on bash _and_ `awk`. Here, we take a deeper dive and create some more complicated awk scripts, as well as utilize the bash skills learned in previous projects.
-
-**Scope:** bash, `awk`, bash scripts, R, Python
-
-.Learning Objectives
-****
-- Use awk to process and manipulate textual data.
-- Use piping and redirection within the terminal to pass around data between utilities.
-- Write bash scripts to automate potential repeated tasks.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-++++
-
-++++
-
-You may have noticed that the "Store Location" column (8th column) contains latitude and longitude coordinates. That is some rich data that could be fun and useful.
-
-The data will look something like the following:
-
-----
-1013 MAINKEOKUK 52632(40.39978, -91.387531)
-----
-
-What this means is that you can't just parse out the latitude and longitude coordinates and call it a day -- you need to use `awk` functions like `gsub` and `split` to extract the latitude and longitude coordinates.
-
-Use `awk` to print out the latitude and longitude for each line in the original dataset. Output should resemble the following.
-
-----
-lat,lon
-1.23,4.56
-----
-
-[NOTE]
-====
-Make sure to take care of rows that don't have latitude and longitude coordinates -- just skip them. So if your results look like this, you need to add logic to skip the "empty" rows:
-
-----
-40.39978, -91.387531
-40.739238, -95.02756
-40.624226, -91.373211
-,
-41.985887, -92.579244
-----
-
-To do this, just go ahead and wrap your print in an if statement similar to:
-
-[source,awk]
-----
-if (length(coords[1]) > ) {
- print coords[1]";"coords[2]
-}
-----
-====
-
-[TIP]
-====
-`split` and `gsub` will be useful `awk` functions to use for this question.
-====
-
-[TIP]
-====
-If we have a bunch of data formatted like the following:
-
-----
-1013 MAINKEOKUK 52632(40.39978, -91.387531)
-----
-
-If we first used `split` to split on "(", for example like:
-
-[source,awk]
-----
-split($8, coords, "(", seps);
-----
-
-`coords[2]` would be:
-
-----
-40.39978, -91.387531)
-----
-
-Then, you could use `gsub` to remove any ")" characters from `coords[2]` like:
-
-[source,awk]
-----
-gsub(/\)/, "", coords[2]);
-----
-
-`coords[2]` would be:
-
-----
-40.39978, -91.387531
-----
-
-At this point I'm sure you can see how to use `awk` to extract and print the rest!
-====
-
-[IMPORTANT]
-====
-Don't forget any lingering space after the first comma! We don't want that.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Redo question (4) (and reproduce `sales_by_store.csv`) from project (5), but this time add 2 additional columns to the dataset -- `lat` and `lon`.
-
-- 'lat': latitude
-- 'lon': longitude
-
-Before you panic (this was a tough question), we've provided the solution below as a starting point for you.
-
-[source,ipynb]
-----
-%%bash
-
-awk -F';' 'BEGIN{ print "store_name;month_number;year;sold_usd;volume_sold" }
- {
- gsub(/\$/, "", $22); split($2, dates, "/", seps);
- mysales[$4";"dates[1]";"dates[3]] += $22;
- myvolumes[$4";"dates[1]";"dates[3]] += $24;
- }
- END{
- for (mytriple in mysales)
- {
- print mytriple";"mysales[mytriple]";"myvolumes[mytriple]
- }
- }' /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt > sales_by_store.csv
-----
-
-[CAUTION]
-====
-It may take a few minutes to run this script. Grab a coffee, tea, or something else to keep you going.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Believe it or not, `awk` even supports geometric calculations like `sin` and `cos`. Write a bash script that, given a pair of latitude and pair of longitude, calculates the distance between the two points.
-
-Okay, so how to get started? To calculate this, we can use https://en.wikipedia.org/wiki/Haversine_formula[the Haversine formula]. The formula is:
-
-$2*r*arcsin(\sqrt{sin^2(\frac{\phi_2 - \phi_1}{2}) + cos(\phi_1)*cos(\phi_2)*sin^2(\frac{\lambda_2 - \lambda_1}{2})})$
-
-Where:
-
-- $r$ is the radius of the Earth in kilometers, we can use: 6367.4447 kilometers
-- $\phi_1$ and $\phi_2$ are the latitude coordinates of the two points
-- $\lambda_1$ and $\lambda_2$ are the longitude coordinates of the two points
-
-In `awk`, `sin` is `sin`, `cos` is `cos`, and `sqrt` is `sqrt`.
-
-To get the `arcsin` use the following `awk` function:
-
-[source,awk]
-----
-function arcsin(x) { return atan2(x, sqrt(1-x*x)) }
-----
-
-To convert from degrees to radians, use the following `awk` function:
-
-[source,awk]
-----
-function dtor(x) { return x*atan2(0, -1)/180 }
-----
-
-The following is how the script should work (with a real example you can test):
-
-[source,bash]
-----
-./question3.sh 40.39978 -91.387531 40.739238 -95.02756
-----
-
-.Results
-----
-309.57
-----
-
-[TIP]
-====
-To include functions in your `awk` command, do as follows:
-
-[source,bash]
-----
-awk -v lat1=$1 -v lat2=$3 -v lon1=$2 -v lon2=$4 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }BEGIN{
- lat1 = dtor(lat1);
- print lat1;
- # rest of your code here!
-}'
-----
-====
-
-[TIP]
-====
-We want you to create a bash script called `question3.sh`. After you have your bash script, we want you to run it in a bash cell to see the output.
-
-The following is some skeleton code that you can use to get started.
-
-[source,bash]
-----
-#!/bin/bash
-
-lat1=$1
-lat2=$3
-lon1=$2
-lon2=$4
-
-awk -v lat1=$1 -v lat2=$3 -v lon1=$2 -v lon2=$4 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }BEGIN{
- lat1 = dtor(lat1);
- print lat1;
- # rest of your code here!
-}'
-----
-====
-
-[TIP]
-====
-You may need to give your script execute permissions like this.
-
-[source,bash]
-----
-chmod +x /path/to/question3.sh
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Create a new bash script called `question4.sh` that accepts a latitude, longitude, filename, and n.
-
-The latitude and longitude are a point that we want to calculate the distance from.
-
-The filename is `sales_by_store.csv` -- our resulting dataset from question 3.
-
-Finally, n is the number of stores from our `sales_by_store.csv` file that we want to calculate the distance from the provided longitude and latitude.
-
-[source, bash]
-----
-./question4.sh 40.39978 -91.387531 sales_by_store.csv 3
-----
-
-.Output
-----
-Distance from (40.39978,-91.387531)
-store_name,distance
-The Music Station,253.915
-KUM & GO #4 / LAMONI,213.455
-KUM & GO #4 / LAMONI,213.447
-----
-
-To get you started, you can use the following "starter" code. Fix the code to work:
-
-[source,bash]
-----
-#!/bin/bash
-
-lat_from=$1
-lon_from=$2
-file=$3
-n=$4
-
-awk -F';' -v n=$n -v lat_from=$lat_from -v lon_from=$lon_from 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }function distance(lat1, lon1, lat2, lon2) {
- # question 2 code here <1>
- return dist;
-}BEGIN {
- print "Distance from ("lat_from","lon_from")"
- print "store_name,distance";
-} NR>1 && NR <= n+1 {
- lat2 = FIXME; <2>
- lon2 = FIXME; <3>
- dist = distance(lat_from, lon_from, FIXME, FIXME); <4>
- print $1","dist
-}' $file
-----
-
-<1> Add your code from question 2 here and make sure your distance is stored in a variable called `dist` (which we return).
-<2> Which value goes here?
-<3> Which value goes here?
-<4> Which values go here?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5 (optional, 0 pts)
-
-Use your choice of Python or R, with our `sales_by_store.csv` to create a beautiful graphic mapping the latitudes and longitudes of the stores. If you want to, get creative and increase the size of the points on the map based on the number of sales. You could create a graphic for each month to see how sales change month-to-month. The options are limitless, get creative!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project08.adoc
deleted file mode 100644
index 4bd86e052..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project08.adoc
+++ /dev/null
@@ -1,262 +0,0 @@
-= STAT 29000: Project 8 -- Fall 2021
-
-**Motivation:**
-
-**Context:** This is the third and final part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
-
-**Scope:** awk, bash scripts, R, Python
-
-.Learning Objectives
-****
-- Use awk to process and manipulate textual data.
-- Use piping and redirection within the terminal to pass around data between utilities.
-- Write bash scripts to automate potential repeated tasks.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/taxi/*`
-
-== Questions
-
-[NOTE]
-====
-This is the _last_ project based on bash and awk -- the rest are SQL. If you struggled or did not like the bash projects, you are not alone! This is frequently the most intimidating for students. Students tend to really like the SQL projects, so relief is soon to come.
-====
-
-=== Question 1
-
-++++
-
-++++
-
-++++
-
-++++
-
-Take some time to explore `/depot/datamine/data/taxi/**`, and answer the following questions using UNIX utilities.
-
-- In which two directories is the bulk of the data (except `fhv` -- we don't care about that data for now)?
-- What is the total size in Gb of the data in those two directories?
-
-[NOTE]
-====
-So for example do all the files in `dir1` have the same number of columns for every row? Do the files in `dir2` have the same number of columns for every row?
-====
-
-[TIP]
-====
-Check out the PDFs in the directory to learn more about the dataset.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-To start, let's focus on the yellow taxi data. The `Total_Amt` column is the total cost of the taxi ride. It is broken down into 4 categories: `Fare_Amt`, `surcharge`, `Tip_Amt`, and `Tolls_Amt`.
-
-Write a bash script called `question2.sh` that accepts a path to a yellow taxi data file as an argument, and returns a breakdown of the overall percentage each of the 4 categories make up of the total.
-
-.Example output
-----
-fares: 5.0%
-surcharges: 2.5%
-tips: 2.5%
-tolls: 90.0%
-----
-
-To help get you started, here is some skeleton code.
-
-[source,bash]
-----
-#!/bin/bash
-
-awk -F',' '{
- # calculate stuff
- fares+=$13;
-} END {
- # print stuff
-}' $1
-----
-
-[IMPORTANT]
-====
-Make sure your output format matches this example exactly. Every value should be with 1 decimal place followed by a percentage sign.
-====
-
-[CAUTION]
-====
-It may take a minute to run. You are processing 2.5G of data!
-====
-
-[TIP]
-====
-https://unix.stackexchange.com/questions/383378/awk-with-one-decimal-place[This] link may be useful.
-====
-
-[TIP]
-====
-The result of the following.
-
-[source,ipynb]
-----
-%%bash
-
-chmod +x ./question2.sh
-./question2.sh /depot/datamine/data/taxi/yellow/yellow_tripdata_2009-01.csv
-----
-
-Should be:
-
-----
-fares: 92.6%
-surcharges: 1.7%
-tips: 4.5%
-tolls: 1.1%
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Did you know `awk` has the ability to process multiple files at once? Pass multiple files to your script from question (2) to test it out.
-
-[source,bash]
-----
-%%bash
-
-chmod +x ./question3.sh
-./question3.sh /depot/datamine/data/taxi/yellow/yellow_tripdata_2009-01.csv /depot/datamine/data/taxi/yellow/yellow_tripdata_2009-02.csv
-----
-
-Now, modify your script from question (2). Return the summary values from question (2), but for each month instead of for the overall data. Use `Trip_Pickup_dateTime` to determine the month.
-
-.Example output
-....
-January
-----
-fares: 5.0%
-surcharges: 2.5%
-tips: 2.5%
-tolls: 90.0%
-----
-
-February
-----
-fares: 5.0%
-surcharges: 2.5%
-tips: 2.5%
-tolls: 90.0%
-----
-
-etc..
-....
-
-[IMPORTANT]
-====
-You may will need to pass more than 1 file to your script in order to get more than 1 month of output.
-====
-
-To help get you started, you can find some skeleton code below.
-
-[source,bash]
-----
-#!/bin/bash
-
-awk -F',' 'BEGIN{
- months[1] = "January"
- months[2] = "February"
- months[3] = "March"
- months[4] = "April"
- months[5] = "May"
- months[6] = "June"
- months[7] = "July"
- months[8] = "August"
- months[9] = "September"
- months[10] = "October"
- months[11] = "November"
- months[12] = "December"
-} NR > 1 {
- # use split to parse out the month
-
- # convert the month to int
- month = int();
-
- # sum values by month using awk array
-
-} END {
- for (m in total) {
- if (m != 0) {
- # print stuff
- }
- }
-}' $@
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-Pick 1 of the 2 following questions to answer. If you would like to answer both, your instructors and graders will be wow'd and happy (no pressure)!
-
-To be clear, however, you only need to answer 1 of the following 2 questions in order to get full credit.
-====
-
-=== Question 4 (Option 1)
-
-There are a lot of interesting questions that you could ask for this dataset. Here are some questions that could be interesting:
-
-- Does time of day, day of week, or month of year appear to have an effect on tips?
-- Are people indeed more generous (with tips) near Christmas?
-- How many trips are there, by hour of day? What are the rush hours?
-- Do different vendors charge more or less than other vendors?
-
-Either choose a provided question, or write your own. Use your newfound knowledges of UNIX utilities and bash scripts to answer the question. Include the question you want answered, what, if any, hypotheses you have, what the data told you, and what you conclude (anecdotally).
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4 (Option 2)
-
-Standard UNIX utilities are not the end-all be-all to terminal tools. https://github.com/ibraheemdev/modern-unix[this repository] has a lot of really useful tools that tend to have an opinionated take on a classic UNIX tool.
-
-https://github.com/BurntSushi/ripgrep[ripgrep] is the poster child of this new generation of tools. It is a text search utility that is empirically superior in the majority of metrics (to `grep`). Additionally, it has subjectively better defaults. You can read (in _great_ detail) about ripgrep https://blog.burntsushi.net/ripgrep/[here].
-
-In addition to those tools, there is https://github.com/BurntSushi/xsv[xsv from the same developer as ripgrep]. `xsv` is a utility designed to perform operations on delimited separated value files. Many of the questions that have been asked about in the previous few projects could have been quickly and easily answered using `xsv`.
-
-Most of these utilities are available to you in a `bash` cell in Jupyter Lab. Choose 2 questions from previous projects and re-answer them using these modern tools. Which did you prefer, and why?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project09.adoc
deleted file mode 100644
index 8d35c866c..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project09.adoc
+++ /dev/null
@@ -1,318 +0,0 @@
-= STAT 29000: Project 9 -- Fall 2021
-
-**Motivation:** Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. In fact, https://cloudflare.com[Cloudflare], a billion dollar company, had much of its starting infrastructure built on top of a Postgresql database (per https://news.ycombinator.com/item?id=22878136[this thread on hackernews]). Learning SQL is well worth your time!
-
-**Context:** There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we've spent much of this semester in the terminal, we will start in the terminal using SQLite.
-
-**Scope:** SQL, sqlite
-
-.Learning Objectives
-****
-- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
-- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
-- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/movies_and_tv/imdb.db`
-
-In addition, the following is an illustration of the database to help you understand the data.
-
-image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"]
-
-For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above.
-
-To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook.
-
-[source,ipython]
-----
-%load_ext sql
-%sql sqlite:////depot/datamine/data/movies_and_tv/imdb.db
-----
-
-The first command loads the sql extension. The second command connects to the database.
-
-For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells.
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-Get started by taking a look at the available tables in the database. What tables are available?
-
-[TIP]
-====
-You'll want to prepend `%%sql` to the top of the cell -- it should be the very first line of the cell (no comments or _anything_ else before it).
-
-[source,ipython]
-----
-%%sql
-
--- Query here
-----
-====
-
-[TIP]
-====
-In sqlite, you can show the tables using the following query:
-
-[source, sql]
-----
-.tables
-----
-
-Unfortunately, sqlite-specific functions can't be run in a Jupyter Lab cell like that. Instead, we need to use a different query.
-
-[source, sql]
-----
-SELECT tbl_name FROM sqlite_master where type='table';
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Its always a good idea to get an idea what your table(s) looks like. A good way to do this is to get the first 5 rows of data from the table. Write and run 6 queries that return the first 5 rows of data of each table.
-
-To get a better idea of the size of the data, you can use the `count` clause to get the number of rows in each table. Write an run 6 queries that returns the number of rows in each table.
-
-[TIP]
-====
-Run each query in a separate cell, and remember to limit the query to return only 5 rows each.
-
-You can use the `limit` clause to limit the number of rows returned.
-====
-
-**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries], xref:book:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-the-first-5-rows-of-the-employees-table[useful example]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-This dataset contains movie data from https://imdb.com (an Amazon company). As you can probably guess, it would be difficult to load the data from those tables into a nice, neat dataframe -- it would just take too much memory on most systems!
-
-Okay, let's dig into the `titles` table a little bit. Run the following query.
-
-[source, sql]
-----
-SELECT * FROM titles LIMIT 5;
-----
-
-As you can see, every row has a `title_id` for the associated title of a movie or tv show (or other). What is this `title_id`? Check out the following link:
-
-https://www.imdb.com/title/tt0903747/
-
-At this point, you may suspect that it is the id imdb uses to identify a movie or tv show. Well, let's see if that is true. Query our database to get any matching titles from the `titles` table matching the `title_id` provided in the link above.
-
-[TIP]
-====
-The `where` clause can be used to filter the results of a query.
-====
-
-**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries], xref:book:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-only-employees-with-the-first-name-steve-or-last-name-laura[useful example]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-That is pretty cool! Not only do you understand what the `title_id` means _inside_ the database -- but now you know that you can associate a web page with each `title_id` -- for example, if you run the following query, you will get a `title_id` for a "short" called "Carmencita".
-
-[source, sql]
-----
-SELECT * FROM titles LIMIT 5;
-----
-
-.Output
-----
-title_id, type, ...
-tt0000001, short, ...
-----
-
-If you navigate to https://www.imdb.com/title/tt0000001/, sure enough, you'll see a neatly formatted page with data about the movie!
-
-Okay great. Now, if you take a look at the `episodes` table, you'll see that there are both an `episode_title_id` and `show_title_id` associated with each row.
-
-Let's try and make sense of this the same way we did before. Write a query using the `where` clause to find all rows in the `episodes` table where `episode_title_id` is `tt0903747`. What did you get?
-
-Now, write a query using the `where` clause to find all rows in the `episodes` table where `show_title_id` is `tt0903747`. What did you get?
-
-**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries], xref:book:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-only-employees-with-the-first-name-steve-or-last-name-laura[useful example]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-Very interesting! It looks like we didn't get any results when we queried for `episode_title_id` with an id of `tt0903747`, but we did for `show_title_id`. This must mean these ids can represent both a _show_ as well as the _episode_ of a show. By that logic, we should be able to find the _title_ of one of the Breaking Bad episodes, in the same way we found the title of the show itself, right?
-
-Okay, take a look at the results of your second query from question (4). Choose one of the `episode_title_id` values, and query the `titles` table to find the title of that episode.
-
-Finally, in a browser, verify that the title of the episode is correct. To verify this, take the `episode_title_id` and plug it into the following link.
-
-https://www.imdb.com/title//
-
-So, I used `tt1232248` for my query. I would check to make sure it matches this.
-
-https://www.imdb.com/title/tt1232248/
-
-**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries], xref:book:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-only-employees-with-the-first-name-steve-or-last-name-laura[useful example]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 6
-
-++++
-
-++++
-
-Okay, you should have now established that every _row_ in the `titles` table correlates to the title of a single episode of a tv show, the tv show itself, a movie, a short, or any other type of media that has a title! A single tv show, will have both a `title_id` for the name of the show itself, as well as a `title_id` for each individual episode.
-
-What if we wanted to get a list of episodes (_including_ the titles) for the show? Well, the _best_ way would probably be to use a _join_ statement -- but we are _just_ getting started, so we will skip that option (for now).
-
-Instead, we can use what is called a _subquery_. A _subquery_ is a query that is embedded inside another query. In this case, we are going to use a _subquery_ to find all the `episode_title_id` values for Breaking Bad, and use the `where` clause to filter our titles from our `titles` table where the `title_id` from the `titles` table is _in_ the result of our subquery.
-
-The following are some steps to help you figure this out.
-
-. Write a query that finds all the `episode_title_id` values for Breaking Bad.
-+
-[TIP]
-====
-We only need/want to keep the `episode_title_id` values, not the other fields like `show_title_id` or `season_number` or `episode_number`.
-====
-+
-. Once you have your query, use it as a _subquery_ to find all the `title_id` values for Breaking Bad.
-+
-[TIP]
-====
-Here is the general "form" for this.
-
-[source, sql]
-----
-SELECT _ FROM (SELECT _ FROM _ WHERE _) WHERE _;
-----
-
-Where the part surrounded by parentheses is the _subquery_.
-
-Of course, for this question, we just want to see if the `title_id` values are in the result of our subquery. For this, we can use the `in` operator.
-
-[source, sql]
-----
-SELECT _ FROM _ WHERE _ IN (SELECT _ FROM _ WHERE_);
-----
-====
-
-When done correctly, you should get a list of all of the `titles` table data for every episode in Breaking Bad, cool!
-
-**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 7
-
-++++
-
-++++
-
-Okay, this _subquery_ thing is pretty useful, and a _little_ confusing. How about we practice some more?
-
-Just like in question (6), get a list of the ratings from the `ratings` table for every episode of Breaking Bad. Sort the results from highest to lowest by `rating`. What was the `title_id` of the episode with the highest rating? What was the rating?
-
-**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 8
-
-++++
-
-++++
-
-Write a query that finds a list of `person_id` values (and _just_ `person_id` values) for the episode of Breaking Bad with `title_id` of `tt2301451`. Use the `crew` table to do this. Limit your results to _actors_ only.
-
-**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 9
-
-++++
-
-++++
-
-Use the query from question (8) as a subquery to get the following output.
-
-----
-Name | Approximate Age
-----
-
-Use _aliases_ to rename the output. To calculate the approximate age, subtract the year the actor was born from 2021 -- that will be accurate for the majority of people.
-
-**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries], xref:book:SQL:aliasing.adoc[aliasing]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project10.adoc
deleted file mode 100644
index 3d7f679cf..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project10.adoc
+++ /dev/null
@@ -1,144 +0,0 @@
-= STAT 29000: Project 10 -- Fall 2021
-
-**Motivation:** Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a "bread-and-butter" skill for anyone working with data.
-
-**Context:** We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like `AVG`, `MIN`, and `MAX`.
-
-**Scope:** SQL, sqlite
-
-.Learning Objectives
-****
-- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
-- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
-- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.
-- Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/taxi/taxi_sample.db`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-In project (8), you used bash tools, including `awk`, to parse through large amounts of yellow taxi data from `/depot/datamine/data/taxi/`. Of course, calculating things like the mean is not too difficult using `awk`, and `awk` _is_ extremely fast and efficient, BUT SQL is better for some of the work we attempted to do in project (8).
-
-Don't take my word on it! We've placed a sample of 5 of the data files for the yellow taxi cab into an SQLite database called `taxi_sample.db`. This database contains, among other things, the `yellow` table (for yellow taxi cab data).
-
-Write a query that will return the `fare_amount`, `surcharge`, `tip_amount`, and `tolls_amount` as a percentage of `total_amount`.
-
-Now, take into consideration that this query will be evaluating these percentages for 5 of the data files, not just the first file or so. Wow, impressive!
-
-[TIP]
-====
-Use the `sum` aggregate function to calculate the totals, and division to figure out the percentages.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Check out the `payment_type` column. Write a query that counts the number of each type of `payment_type`. The end result should print something like the following.
-
-.Output sample
-----
-payment_type, count
-CASH, 123
-----
-
-[TIP]
-====
-You can use aliasing to control the output header names.
-====
-
-Write a query that sums the `total_amount` for `payment_type` of "CASH". What is the total amount of cash payments?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Write a query that gets the largest number of passengers in a single trip. How far was the trip? What was the total amount? Answer all of this in a single query.
-
-Whoa, there must be some erroneous data in the database! Not too surprising. Write a query that explores this more, explain what your query does and how it helps you understand what is going on.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-Write a query that gets the average `total_amount` for each year in the database. Which year has the largest average `total_amount`? Use the `pickup_datetime` column to determine the year.
-
-[TIP]
-====
-Read https://www.sqlite.org/lang_datefunc.html[this] page and look at the strftime function.
-====
-
-[TIP]
-====
-If you want the headers to be more descriptive, you can use aliases.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-What percent of data in our database has information on the _location_ of pickup and dropoff? Examine the data, to see if there is a pattern to the rows _with_ that information and _without_ that information.
-
-[TIP]
-====
-There _is_ a distinct pattern. Pay attention to the date and time of the data.
-====
-
-Confirm your hypothesis with the original data set(s) (in `/depot/datamine/data/taxi/yellow/*.csv`), using bash. This doesn't have to be anything more thorough than running a simple `head` command with a 1-2 sentence explanation.
-
-[TIP]
-====
-Of course, there will probably be some erroneous data for the latitude and longitude columns. However, you could use the `avg` function on a latitude or longitude column, by _year_ to maybe get a pattern.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project11.adoc
deleted file mode 100644
index 0adb5a950..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project11.adoc
+++ /dev/null
@@ -1,270 +0,0 @@
-= STAT 29000: Project 11 -- Fall 2021
-
-**Motivation:** Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like `MIN`, `MAX`, and `AVG` in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values.
-
-**Context:** We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values!
-
-**Scope:** SQL, SQL in R
-
-.Learning Objectives
-****
-- Demonstrate the ability to interact with popular database management systems within R.
-- Solve data-driven problems using a combination of SQL and R.
-- Basic clauses: SELECT, ORDER BY, LIMIT, DESC, ASC, COUNT, WHERE, FROM, etc.
-- Showcase the ability to filter, alias, and write subqueries.
-- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. Explain when to use having, and when to use where.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/movies_and_tv/imdb.db`
-
-In addition, the following is an illustration of the database to help you understand the data.
-
-image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"]
-
-For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above.
-
-To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook.
-
-[source,ipython]
-----
-%load_ext sql
-%sql sqlite:////depot/datamine/data/movies_and_tv/imdb.db
-----
-
-The first command loads the sql extension. The second command connects to the database.
-
-For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells.
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-Let's say we are interested in the Marvel Cinematic Universe (MCU). We could write the following query to get the titles of all the movies in the MCU (at least, available in our database).
-
-[source, sql]
-----
-SELECT premiered, COUNT(*) FROM titles WHERE title_id IN ('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286') GROUP BY premiered;
-----
-
-The result would be a perfectly good-looking table. Now, with that being said, are the headers good-looking? I don't know about you, but `COUNT(*)` as a header is pretty bad looking. xref:book:SQL:aliasing.adoc[Aliasing] is a great way to not only make the headers look good, but it can also be used to reduce the text in a query by giving some intermediate results a shorter name.
-
-Fix the query so that the headers are `year` and `movie count`, respectively.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Okay, let's say we are interested in modifying our query from question (1) to get the _percentage_ of MCU movies released in each year. Essentially, we want the count for each group, divided by the total count of all the movies in the MCU.
-
-We can achieve this using a _subquery_. A subquery is a query that is used to get a smaller result set from a larger result set.
-
-Write a query that returns the total count of the movies in the MCU, and then use it as a subquery to get the percentage of MCU movies released in each year.
-
-[TIP]
-====
-You do _not_ need to change the query from question (1), rather, you just need to _add_ to the query.
-====
-
-[TIP]
-====
-You can directly divide `COUNT(*)` from the original query by the subquery to get the result!
-====
-
-[IMPORTANT]
-====
-Your initial result may seem _very_ wrong (no fractions at all!) this is OK -- we will fix this in the next question.
-====
-
-[IMPORTANT]
-====
-Use aliasing to rename the new column to `percentage`.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Okay, if you did question (2) correctly, you should have got a result that looks a lot like:
-
-.Output
-----
-year,movie count,percentage
-2008, 2, 0
-2010, 1, 0
-2011, 2, 0
-...
-----
-
-What is going on?
-
-The `AS` keyword can _also_ be used to _cast_ types. Some of you may or may not be familiar with a feature of many programming languages. Common in many programming languages is an "integer" type -- which is for numeric data _without_ a decimal place, and a "float" type -- which is for numeric data _with_ a decimal place. In _many_ languages, if you were to do the following, you'd get what _may_ be unexpected output.
-
-[source,c]
-----
-9/4
-----
-
-.Output
-----
-2
-----
-
-Since both of the values are integers, the result will truncate the decimal place. In other words, the result will be 2, instead of 2.25.
-
-In Python, they've made changes so this doesn't happen.
-
-[source,python]
-----
-9/4
-----
-
-.Output
-----
-2.25
-----
-
-However, if we want the "regular" functionality we can use the `//` operator.
-
-[source,python]
-----
-9//4
-----
-
-.Output
-----
-2
-----
-
-Okay, sqlite does this as well.
-
-[source, sql]
-----
-SELECT 9/4 as result;
-----
-
-.Output
-----
-result
-2
-----
-
-_This_ is why we are getting 0's for the percentage column!
-
-How do we fix this? The following is an example.
-
-[source, sql]
-----
-SELECT CAST(9 AS real)/4 as result;
-----
-
-.Output
-----
-result
-2.25
-----
-
-[NOTE]
-====
-Here, "real" represents "float" or "double" -- it is another way of saying a number with a decimal place.
-====
-
-[IMPORTANT]
-====
-When you do arithmetic with an integer and a real/float, the result will be a real/float.
-====
-
-Fix the query so that the results look something like:
-
-.Output
-----
-year, movie count, percentage
-2008, 2, 0.0689...
-2010, 1, 0.034482...
-2011, 2, 0.0689...
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-You now know 2 different applications of the `AS` keyword, and you also know how to use a query as a subquery, great!
-
-In the previous project, we were introduced to aggregate functions. We used the GROUP BY clause to group our results by the `premiered` column in this project too! We know we can use the `WHERE` clause to filter our results, but what if we wanted to filter our results based on an aggregated column?
-
-Modify our query from question (3) to print only the rows where the `movie count` is greater than 2.
-
-[TIP]
-====
-See https://www.geeksforgeeks.org/having-vs-where-clause-in-sql/[this article] for more information on the `HAVING` and `WHERE` clauses.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-Write a query that returns the average number of words in the `primary_title` column, by year, and only for years where the average number of words in the `primary_title` is less than 3.
-
-Look at the results. Which year had the lowest average number of words in the `primary_title` column (no need to write another query for this, just eyeball it)?
-
-[TIP]
-====
-See https://stackoverflow.com/questions/3293790/query-to-count-words-sqlite-3[here]. Replace "@String" with the column you want to count the words in.
-====
-
-[TIP]
-====
-If you got it right, there should be 15 rows in the output.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project12.adoc
deleted file mode 100644
index a8fba9cd5..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project12.adoc
+++ /dev/null
@@ -1,143 +0,0 @@
-= STAT 29000: Project 12 -- Fall 2021
-
-**Motivation:** Databases are (usually) comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform "joins"! In this project we will explore learn about and practice using joins on our imdb database, as it has many tables where the benefit of joins is obvious.
-
-**Context:** We've introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables.
-
-**Scope:** SQL, sqlite, joins
-
-.Learning Objectives
-****
-- Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem.
-- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING.
-- Showcase the ability to filter, alias, and write subqueries.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/movies_and_tv/imdb.db`
-
-In addition, the following is an illustration of the database to help you understand the data.
-
-image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"]
-
-For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above.
-
-To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook.
-
-[source,ipython]
-----
-%load_ext sql
-%sql sqlite:////depot/datamine/data/movies_and_tv/imdb.db
-----
-
-The first command loads the sql extension. The second command connects to the database.
-
-For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells.
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-In the previous project, we provided you with a query to get the number of MCU movies that premiered in each year.
-
-Now that we are learning about _joins_, we have the ability to make much more interesting queries!
-
-Use the provided list of `title_id` values to get a list of the MCU movie `primary_title` values, `premiered` values, and rating (from the provided list of MCU movies).
-
-Which movie had the highest rating? Modify your query to return only the 5 highest and 5 lowest rated movies (again, from the MCU list).
-
-.List of MCU title_ids
-----
-('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286')
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Run the following query.
-
-[source,ipython]
-----
-%%sql
-
-SELECT * FROM titles WHERE title_id IN ('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286');
-----
-
-Pay close attention to the movies in the output. You will notice there are movies presented in this query that are (likely) not in the query results you got for question (1).
-
-Write a query that returns the `primary_title` of those movies _not_ shown in the result of question (1) but that _are_ shown in the result of the query above. You can use the query in question (1) as a subquery to answer this.
-
-Can you notice a pattern to said movies?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-In the previous questions we explored what is _actually_ the difference between an INNER JOIN, and a LEFT JOIN. It is likely you used an INNER JOIN/JOIN in your solution to question (1). As a result, the MCU movies that did not yet have a rating in IMDB are not shown in the output of question (1).
-
-Modify your query from question (1) so that it returns a list of _all_ MCU movies with their associated rating, regardless of whether or not the movie has a rating.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-In the previous project, question (5) asked you to write a query that returns the average number of words in the `primary_title` column, by year, and only for years where the average number of words in the `primary_title` is less than 3.
-
-Okay, great. What would be more interesting would be to see the average number of words in the `primary_title` column for titles with a rating of 8.5 or higher. Write a query to do that. How many words on average does a title with 8.5 or higher rating have?
-
-Write another query that does the same for titles with < 8.5 rating. Is the average title length notably different?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-We have a fun database, and you've learned a new trick (joins). Use your newfound knowledge to write a query that uses joins to accomplish a task you couldn't previously (easily) tackle, and answers a question you are interested in.
-
-Explain what your query does, and talk about the results. Explain why you chose either a LEFT join or INNER join.
-
-.Items to submit
-====
-- A written question about the movies/tv shows in the database.
-- Code used to solve this problem.
-- Output from running the code.
-- Explanation of the results, what your query does, and why you chose either a LEFT or INNER join.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project13.adoc
deleted file mode 100644
index 2f668bb06..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project13.adoc
+++ /dev/null
@@ -1,325 +0,0 @@
-= STAT 29000: Project 13 -- Fall 2021
-
-**Motivation:** In the previous projects, you've gained experience writing all types of queries, touching on the majority of the main concepts. One critical concept that we _haven't_ yet done is creating your _own_ database. While typically database administrators and engineers will typically be in charge of large production databases, it is likely that you may need to prop up a small development database for your own use at some point in time (and _many_ of you have had to do so this year!). In this project, we will walk through all of the steps to prop up a simple sqlite database for one of our datasets.
-
-**Context:** This is the final project for the semester, and we will be walking through the useful skill of creating a database and populating it with data. We will (mostly) be using the [sqlite3](https://www.sqlite.org/) command line tool to interact with the database.
-
-**Scope:** sql, sqlite, unix
-
-.Learning Objectives
-****
-- Create a sqlite database schema.
-- Populate the database with data using `INSERT` statements.
-- Populate the database with data using the command line interface (CLI) for sqlite3.
-- Run queries on a database.
-- Create an index to speed up queries.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/flights/subset/2007.csv`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-First thing is first, create a new Jupyter Notebook called `firstname-lastname-project13.ipynb`. You will put the text of your solutions in this notebook. Next, in Jupyter Lab, open a fresh terminal window. We will be able to run the `sqlite3` command line tool from the terminal window.
-
-Okay, once completed, the first step is schema creation. First, it is important to note. **The goal of this project is to put the data in `/depot/datamine/data/flights/subset/2007.csv` into a sqlite database we will call `firstname-lastname-project13.db`.**
-
-With that in mind, run the following (in your terminal) to get a sample of the data.
-
-[source,bash]
-----
-head /depot/datamine/data/flights/subset/2007.csv
-----
-
-You _should_ receive a result like:
-
-.Output
-----
-Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
-2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0
-2007,1,1,1,1918,1905,2043,2035,WN,462,N370,85,90,74,8,13,SMF,PDX,479,5,6,0,,0,0,0,0,0,0
-2007,1,1,1,2206,2130,2334,2300,WN,1229,N685,88,90,73,34,36,SMF,PDX,479,6,9,0,,0,3,0,0,0,31
-2007,1,1,1,1230,1200,1356,1330,WN,1355,N364,86,90,75,26,30,SMF,PDX,479,3,8,0,,0,23,0,0,0,3
-2007,1,1,1,831,830,957,1000,WN,2278,N480,86,90,74,-3,1,SMF,PDX,479,3,9,0,,0,0,0,0,0,0
-2007,1,1,1,1430,1420,1553,1550,WN,2386,N611SW,83,90,74,3,10,SMF,PDX,479,2,7,0,,0,0,0,0,0,0
-2007,1,1,1,1936,1840,2217,2130,WN,409,N482,101,110,89,47,56,SMF,PHX,647,5,7,0,,0,46,0,0,0,1
-2007,1,1,1,944,935,1223,1225,WN,1131,N749SW,99,110,86,-2,9,SMF,PHX,647,4,9,0,,0,0,0,0,0,0
-2007,1,1,1,1537,1450,1819,1735,WN,1212,N451,102,105,90,44,47,SMF,PHX,647,5,7,0,,0,20,0,0,0,24
-----
-
-An SQL schema is a set of text or code that defines how the database is structured and how each piece of data is stored. In a lot of ways it is similar to how a data.frame has columns with different types -- just more "set in stone" than the very easily changed data.frame.
-
-Each database handles schemas slightly differently. In sqlite, the database will contain a single schema table that describes all included tables, indexes, triggers, views, etc. Specifically, each entry in the `sqlite_schema` table will contain the type, name, tbl_name, rootpage, and sql for the database object.
-
-[NOTE]
-====
-For sqlite, the "database object" could refer to a table, index, view, or trigger.
-====
-
-This detail is more than is needed for right now. If you are interested in learning more, the sqlite documentation is very good, and the relevant page to read about this is https://www.sqlite.org/schematab.html[here].
-
-For _our_ purposes, when I refer to "schema", what I _really_ mean is the set of commands that will build our tables, indexes, views, and triggers. sqlite makes it particularly easy to open up a sqlite database and get the _exact_ commands to build the database from scratch _without_ the data itself. For example, take a look at our `imdb.db` database by running the following in your terminal.
-
-[source,bash]
-----
-module use /scratch/brown/kamstut/tdm/opt/modulefiles
-module load sqlite/3.36.0
-
-sqlite3 /depot/datamine/data/movies_and_tv/imdb.db
-----
-
-This will open the command line interface (CLI) for sqlite3. It will look similar to:
-
-[source,bash]
-----
-sqlite>
-----
-
-Type `.schema` to see the "schema" for the database.
-
-[NOTE]
-====
-Any command you run in the sqlite CLI that starts with a dot (`.`) is called a "dot command". A dot command is exclusive to sqlite and the same functionality cannot be expected to be available in other SQL tools like Postgresql, MariaDB, or MS SQL. You can list all of the dot commands by typing `.help`.
-====
-
-After running `.schema`, you should see a variety of legitimate SQL commands that will create the structure of your database _without_ the data itself. This is an extremely useful self-documenting tool that is particularly useful.
-
-Okay, great. Now, let's study the sample of our `2007.csv` dataset. Create a markdown list of key:value pairs for each column in the dataset. Each _key_ should be the title of the column, and each _value_ should be the _type_ of data that is stored in that column.
-
-For example:
-
-- Year: INTEGER
-
-Where the _value_ is one of the 5 "affinity types" (INTEGER, TEXT, BLOB, REAL, NUMERIC) in sqlite. See section "3.1.1" https://www.sqlite.org/datatype3.html[here].
-
-Okay, you may be asking, "what is the difference between INTEGER, REAL, and NUMERIC?". Great question. In general (for other SQL RDBMSs), there are _approximate_ numeric data types and _exact_ numeric data types. What you are most familiar with is the _approximate_ numeric data types. In R or Python for example, try running the following:
-
-[source,r]
-----
-(3 - 2.9) <= 0.1
-----
-
-.Output
-----
-FALSE
-----
-
-[source,python]
-----
-(3 - 2.9) <= 0.1
-----
-
-.Output
-----
-False
-----
-
-Under the hood, the values are stored as a very close approximation of the real value. This small amount of error is referred to as floating point error. There are some instances where it is _critical_ that values are stored as exact values (for example, in finance). In those cases, you would need to use special data types to handle it. In sqlite, this type is NUMERIC. So, for _our_ example, store text as TEXT, numbers _without_ decimal places as INTEGER, and numbers with decimal places as REAL -- our example dataset doesn't have a need for NUMERIC.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Okay, great! At this point in time you should have a list of key:value pairs with the column name and the data type, for each column. Now, let's put together our `CREATE TABLE` statement that will create our table in the database.
-
-See https://www.sqlitetutorial.net/sqlite-create-table/[here] for some good examples. Realize that the `CREATE TABLE` statement is not so different from any other query in SQL, and although it looks messy and complicated, it is not so bad. Name your table `flights`.
-
-Once you've written your `CREATE TABLE` statement, copy and paste it into the sqlite CLI. Upon success, you should see the statement printed when running the dot command `.schema`. Fantastic! You can also verify that the table exists by running the dot command `.tables`.
-
-Congratulations! To finish things off, please paste the `CREATE TABLE` statement into a markdown cell in your notebook.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-The next step in the project is to add the data! After all, it _is_ a _data_base.
-
-To insert data into a table _is_ a bit cumbersome. For example, let's say we wanted to add the following row to our `flights` table.
-
-.Data to add
-----
-Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
-2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0
-----
-
-The SQL way would be to run the following query.
-
-[source, sql]
-----
-INSERT INTO flights (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay) VALUES (2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0);
-----
-
-NOT ideal -- especially since we have over 7 million rows to add! You could programmatically generate a `.sql` file with the `INSERT INTO` statement, hook the database up with Python or R and insert the data that way, _or_ you could use the wonderful dot commands sqlite already provides.
-
-You may find https://stackoverflow.com/questions/13587314/sqlite3-import-csv-exclude-skip-header[this post] very helpful.
-
-[WARNING]
-====
-You want to make sure you _don't_ include the header line twice! If you included the header line twice, you can verify by running the following in the sqlite CLI.
-
-[source,bash]
-----
-.header on
-SELECT * FROM flights LIMIT 2;
-----
-
-The `.header on` dot command will print the header line for every query you run. If you have double entered the header line, it will appear twice. Once for the `.header on` and another time because that is the first row of your dataset.
-====
-
-Connect to your database in your Jupyter notebook and run a query to get the first 5 rows of your table.
-
-[TIP]
-====
-To connect to your database:
-
-[source,ipython]
-----
-%load_ext sql
-%sql sqlite:////home/PURDUEALIAS/flights.db
-----
-
-Assuming `flights.db` is in your home directory, and you change PURDUEALIAS to your alias, for example `mdw` for Dr. Ward or `kamstut` for Kevin.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-[IMPORTANT]
-====
-For this question, please run take screenshots of your output from the terminal and add them to your notebook using a markdown cell. To do so, let's say you have an image called `my_image.png` in your $HOME directory. All you need to do is run the following command in a markdown cell:
-
-[source,ipython]
-----
-![](/home/PURDUEALIAS/my_image.png)
-----
-
-Be sure to replace PURDUEALIAS with your alias.
-====
-
-Woohoo! You've successfully created a database and populated it with data from a dataset -- pretty cool! Now, run the following dot command in order to _time_ our queries: `.timer on`. This will print out the time it takes to run each query. For example, try the following:
-
-[source, sql]
-----
-SELECT * FROM flights LIMIT 5;
-----
-
-Cool! Time the following query.
-
-[source, sql]
-----
-SELECT * FROM flights ORDER BY DepTime LIMIT 1000;
-----
-
-.Output
-----
-Run Time: real 1.824 user 0.836007 sys 0.605384
-----
-
-That is pretty quick, but if (for some odd reason) there were going to be a lot of queries that searched on exact departure times, this could be a big waste of time when done at scale. What can we do to improve this? Add and index!
-
-Run the following query.
-
-[source, sql]
-----
-EXPLAIN QUERY PLAN SELECT * FROM flights WHERE DepTime = 1232;
-----
-
-The output will indicate that the "plan" is to simply scan the entire table. This has a runtime of O(n), which means the speed is linear to the number of values in the table. If we had 1 million rows and it takes 1 second. If we get to a billion rows, it will take 16 minutes! An _index_ is a data structure that will let us reduce the runtime to O(log(n)). This means if we had 1 million rows and it takes 1 second, if we had 1 billion rows, it would take only 3 seconds. _Much_ more efficient! So what is the catch here? Space.
-
-Leave the sqlite CLI by running `.quit`. Now, see how much space your `flights.db` file is using.
-
-[source,bash]
-----
-ls -la $HOME/flights.db
-----
-
-.Output
-----
-571M
-----
-
-Okay, _after_ I add an index on the `DepTime` column, the file is now `653M` -- while that isn't a _huge_ difference, it would certainly be significant if we scaled up the size of our database. In this case, another drawback would be the insert time. Inserting new data into the database would force the database to have to _update_ the indexes. This can add a _lot_ of time. These are just tradeoffs to consider when you're working with a database.
-
-In this case, we don't care about the extra bit of space -- create an index on the `DepTime` column. https://medium.com/@JasonWyatt/squeezing-performance-from-sqlite-indexes-indexes-c4e175f3c346[This article] is a nice easy read that covers this in more detail.
-
-Great! Once you've created your index, run the following query.
-
-[source, sql]
-----
-SELECT * FROM flights ORDER BY DepTime LIMIT 1000;
-----
-
-.Output
-----
-Run Time: real 0.263 user 0.014261 sys 0.032923
-----
-
-Wow! That is some _serious_ improvement. What does the "plan" look like?
-
-[source, sql]
-----
-EXPLAIN QUERY PLAN SELECT * FROM flights WHERE DepTime = 1232;
-----
-
-You'll notice the "plan" shows it will utilize the index to speed the query up. Great!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-We hope that this project has given you a small glimpse into the "other side" of databases. Now, write a query that uses one or more other columns. Time the query, then, create a _new_ index to speed the query up. Time the query _after_ creating the index. Did it work well?
-
-Document the steps of this problem just like you did for question (4).
-
-**Optional challenge:** Try to make your query utilize 2 columns and create an index on both columns to see if you can get a speedup.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-projects.adoc
deleted file mode 100644
index f129981b1..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-projects.adoc
+++ /dev/null
@@ -1,59 +0,0 @@
-= STAT 29000
-
-== Project links
-
-[NOTE]
-====
-Only the best 10 of 13 projects will count towards your grade.
-====
-
-[CAUTION]
-====
-Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses.
-====
-
-* xref:fall2021/29000/29000-f2021-officehours.adoc[STAT 29000 Office Hours for Fall 2021]
-* xref:fall2021/29000/29000-f2021-project01.adoc[Project 1: Review: How to use Jupyter Lab]
-* xref:fall2021/29000/29000-f2021-project02.adoc[Project 2: Navigating UNIX: part I]
-* xref:fall2021/29000/29000-f2021-project03.adoc[Project 3: Navigating UNIX: part II]
-* xref:fall2021/29000/29000-f2021-project04.adoc[Project 4: Pattern matching in UNIX & R]
-* xref:fall2021/29000/29000-f2021-project05.adoc[Project 5: `awk` & bash scripts: part I]
-* xref:fall2021/29000/29000-f2021-project06.adoc[Project 6: `awk` & bash scripts: part II]
-* xref:fall2021/29000/29000-f2021-project07.adoc[Project 7: `awk` & bash scripts: part III]
-* xref:fall2021/29000/29000-f2021-project08.adoc[Project 8: `awk` & bash scripts: part IV]
-* xref:fall2021/29000/29000-f2021-project09.adoc[Project 9: SQL: part I -- Introduction to SQL]
-* xref:fall2021/29000/29000-f2021-project10.adoc[Project 10: SQL: part II -- SQL in R]
-* xref:fall2021/29000/29000-f2021-project11.adoc[Project 11: SQL: part III -- SQL comparison]
-* xref:fall2021/29000/29000-f2021-project12.adoc[Project 12: SQL: part IV -- Joins]
-* xref:fall2021/29000/29000-f2021-project13.adoc[Project 13: SQL: part V -- Review]
-
-[WARNING]
-====
-Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:55pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete.
-
-**Always** double check that the work that you submitted was uploaded properly. After submitting your project in Gradescope, you will be able to download the project to verify that the content you submitted is what the graders will see. You will **not** get credit for or be able to re-submit your work if you accidentally uploaded the wrong project, or anything else. It is your responsibility to ensure that you are uploading the correct content.
-
-Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza.
-====
-
-== Piazza
-
-=== Sign up
-
-https://piazza.com/purdue/fall2021/stat29000
-
-=== Link
-
-https://piazza.com/purdue/fall2021/stat29000/home
-
-== Syllabus
-
-++++
-include::book:ROOT:partial$syllabus.adoc[]
-++++
-
-== Office hour schedule
-
-++++
-include::book:ROOT:partial$office-hour-schedule.adoc[]
-++++
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project01.adoc
deleted file mode 100644
index 3514f4a64..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project01.adoc
+++ /dev/null
@@ -1,207 +0,0 @@
-= STAT 39000: Project 1 -- Fall 2021
-
-== Mark~it~down, your first project back in The Data Mine
-
-**Motivation:** It's been a long summer! Last year, you got some exposure command line tools, SQL, Python, and other fun topics like web scraping. This semester, we will continue to work primarily using Python _with_ data. Topics will include things like: documentation using tools like sphinx, or pdoc, writing tests, sharing Python code using tools like pipenv, poetry, and git, interacting with and writing APIs, as well as containerization. Of course, like nearly every other project, we will be be wrestling with data the entire time.
-
-We will start slowly, however, by learning about Jupyter Lab. This year, instead of using RStudio Server, we will be using Jupyter Lab. In this project we will become familiar with the new environment, review some, and prepare for the rest of the semester.
-
-**Context:** This is the first project of the semester! We will start with some review, and set the "scene" to learn about a variety of useful and exciting topics.
-
-**Scope:** Jupyter Lab, R, Python, scholar, brown, markdown
-
-.Learning Objectives
-****
-- Read about and understand computational resources available to you.
-- Learn how to run R code in Jupyter Lab on Scholar and Brown.
-- Review.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/`
-
-== Questions
-
-=== Question 1
-
-In previous semesters, we've used a program called RStudio Server to run R code on Scholar and solve the projects. This year, we will be using Jupyter Lab almost exclusively. Let's being by launching your own private instance of Jupyter Lab using a small portion of the compute cluster.
-
-Navigate and login to https://ondemand.anvil.rcac.purdue.edu using 2-factor authentication (ACCESS login on Duo Mobile). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward.
-
-++++
-
-++++
-
-[IMPORTANT]
-====
-In the not-to-distant future, we will be using _both_ Scholar (https://gateway.scholar.rcac.purdue.edu) _and_ Brown (https://ondemand.brown.rcac.purdue.edu) to launch Jupyter Lab instances. For now, however, we will be using Brown.
-====
-
-Towards the middle of the top menu, there will be an item labeled btn:[My Interactive Sessions], click on btn:[My Interactive Sessions]. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Datamine, Desktops, and GUIs. Under the Datamine section, you should see a button that says btn:[Jupyter Lab], click on btn:[Jupyter Lab].
-
-If everything was successful, you should see a screen similar to the following.
-
-image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"]
-
-Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand uses SLURM to launch a job to run Jupyter Lab. This job has access to 1 CPU core and 3072 Mb of memory. It is OK to not understand what that means yet, we will learn more about this in STAT 39000. For the curious, however, if you were to open a terminal session in Scholar and/or Brown and run the following, you would see your job queued up.
-
-[source,bash]
-----
-squeue -u username # replace 'username' with your username
-----
-
-After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on btn:[Connect to Jupyter] to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following.
-
-image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"]
-
-There are 2 primary options that you will need to know about.
-
-f2021-s2022::
-The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment.
-
-[TIP]
-====
-To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page].
-====
-
-f2021-s2022-r::
-An alternative, native R kernel that you can use for projects with _just_ R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell.
-
-For now, let's focus on the f2021-s2022 kernel. Click on btn:[f2021-s2022], and a fresh notebook will be created for you.
-
-Test it out! Run the following code in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on. What is the name of the node you are running on?
-
-[source,python]
-----
-import socket
-print(socket.gethostname())
-----
-
-[TIP]
-====
-To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu.
-====
-
-.Items to submit
-====
-- Code used to solve this problem in a "code" cell.
-- Output from running the code (the name of the node you are running on).
-====
-
-=== Question 2
-
-This year, the first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Scholar and Brown at `/depot/datamine/apps/templates/project_template.ipynb`).
-
-++++
-
-++++
-
-Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`.
-
-There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default?
-
-Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A.
-
-.Items to submit
-====
-- How many of each types of cells are there in the default template?
-====
-
-=== Question 3
-
-Last year, while using RStudio, you probably gained a certain amount of experience using RMarkdown -- a flavor of Markdown that allows you to embed and run code in Markdown. Jupyter Lab, while very different in many ways, still uses Markdown to add formatted text to a given notebook. It is well worth the small time investment to learn how to use Markdown, and create a neat and reproducible document.
-
-++++
-
-++++
-
-Create a Markdown cell in your notebook. Create both an _ordered_ and _unordered_ list. Create an unordered list with 3 of your favorite academic interests (some examples could include: machine learning, operating systems, forensic accounting, etc.). Create another _ordered_ list that ranks your academic interests in order of most-interested to least-interested. To practice markdown, **embolden** at least 1 item in you list, _italicize_ at least 1 item in your list, and make at least 1 item in your list formatted like `code`.
-
-[TIP]
-====
-You can quickly get started with Markdown using this cheat sheet: https://www.markdownguide.org/cheat-sheet/
-====
-
-[TIP]
-====
-Don't forget to "run" your markdown cells by clicking the small "Play" button in the notebook menu. Running a markdown cell will render the text in the cell with all of the formatting you specified. Your unordered lists will be bulleted and your ordered lists will be numbered.
-====
-
-[TIP]
-====
-If you are having trouble changing a cell due to the drop down menu behaving oddly, try changing browsers to Chrome or Safari. If you are a big Firefox fan, and don't want to do that, feel free to use the `%%markdown` magic to create a markdown cell without _really_ creating a markdown cell. Any cell that starts with `%%markdown` in the first line will generate markdown when run.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Browse https://www.linkedin.com and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown in a new Markdown cell. Include the following (at a minimum):
-
-- A header for this section (your choice of size) that says "About".
-+
-[TIP]
-====
-A Markdown header is a line of text at the top of a Markdown cell that begins with one or more `#`.
-====
-+
-- The text of your personal "About" section that you would feel comfortable uploading to LinkedIn.
-- In the about section, _for the sake of learning markdown_, include at least 1 link using Markdown's link syntax.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Read xref:templates.adoc[the templates page] and learn how to run snippets of code in Jupyter Lab _other than_ Python. Run at least 1 example of Python, R, SQL, and bash. For SQL and bash, you can use the following snippets of code to make sure things are working properly.
-
-++++
-
-++++
-
-[source, sql]
-----
--- Use the following sqlite database: /depot/datamine/data/movies_and_tv/imdb.db
-SELECT * FROM titles LIMIT 5;
-----
-
-[source,bash]
-----
-ls -la /depot/datamine/data/movies_and_tv/
-----
-
-For your R and Python code, use this as an opportunity to review your skills. For each language, choose at least 1 dataset from `/depot/datamine/data`, and analyze it. Both solutions should include at least 1 custom function, and at least 1 graphic output. Make sure your code is complete, and well-commented. Include a markdown cell with your short analysis, for each language.
-
-[TIP]
-====
-You could answer _any_ question you have about your dataset you want. This is an open question, just make sure you put in a good amount of effort. Low/no-effort solutions will not receive full credit.
-====
-
-[IMPORTANT]
-====
-Once done, submit your projects just like last year. See the xref:submissions.adoc[submissions page] for more details.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- 1-2 sentence analysis for each of your R and Python code examples.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project02.adoc
deleted file mode 100644
index 7412774e4..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project02.adoc
+++ /dev/null
@@ -1,305 +0,0 @@
-= STAT 39000: Project 2 -- Fall 2021
-
-== The (art?) of a docstring
-
-**Motivation:** Documentation is one of the most critical parts of a project. https://notion.so[There] https://guides.github.com/features/issues/[are] https://confluence.atlassian.com/alldoc/atlassian-documentation-32243719.html[so] https://docs.github.com/en/communities/documenting-your-project-with-wikis/about-wikis[many] https://www.gitbook.com/[tools] https://readthedocs.org/[that] https://bit.ai/[are] https://clickhelp.com[specifically] https://www.doxygen.nl/index.html[designed] https://www.sphinx-doc.org/en/master/[to] https://docs.python.org/3/library/pydoc.html[help] https://pdoc.dev[document] https://github.com/twisted/pydoctor[a] https://swagger.io/[project], and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can't go wrong with tools like https://www.sphinx-doc.org/en/master/[Sphinx], or https://pdoc.dev[pdoc].
-
-**Context:** This is the first project in a 2-project series where we explore thoroughly documenting Python code, while solving data-driven problems.
-
-**Scope:** Python, documentation
-
-.Learning Objectives
-****
-- Use Sphinx to document a set of Python code.
-- Use pdoc to document a set of Python code.
-- Write and use code that serializes and deserializes data.
-- Learn the pros and cons of various serialization formats.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/apple/health/watch_dump.xml`
-
-== Questions
-
-The topics of this semester are outlined in the xref:book:projects:39000-f2021-projects.adoc[39000 home page]. In addition to those topics, there will be a slight emphasis on topics related to working with APIs. Each project this semester will continue to be data-driven, and will be based on the provided dataset(s). The dataset listed for this project will be one that is revisited throughout the semester, as we will be slowly building out functions, modules, tests, documentation, etc, that will come together towards the end of the semester. Of course, all projects that expect any sort of previous work will provide you with previous work in case you chose to skip any given project.
-
-In this project we will work with pdoc to build some simple documentation, review some Python skills that may be rusty, and learn about a serialization and deserialization of data -- a common component to many data science and computer science projects, and a key topics to understand when working with APIs.
-
-For the sake of clarity, this project will have more deliverables than the "standard" `.ipynb` notebook, `.py` file containing Python code, and PDF. In this project, we will ask you to submit an additional PDF showing the documentation webpage that you will have built by the end of the project. How to do this will be made clear in the given question.
-
-[WARNING]
-====
-Make sure to select 4096 MB of RAM for this project. Otherwise you may get an issue reading the dataset in question 3.
-====
-
-=== Question 1
-
-Let's start by navigating to https://ondemand.brown.rcac.purdue.edu, and launching a Jupyter Lab instance. In the previous project, you learned how to run various types of code in a Jupyter notebook (the `.ipynb` file). Jupyter Lab is actually _much_ more useful. You can open terminals on Brown (the cluster), as well as open a an editor for `.R` files, `.py` files, or any other text-based file.
-
-Give it a try. In the "Other" category in the Jupyter Lab home page, where you would normally select the "f2021-s2022" kernel, instead select the "Python File" option. Upon clicking the square, you will be presented with a file called `untitled.py`. Rename this file to `firstname-lastname-project02.py` (where `firstname` and `lastname` are your first and last name, respectively).
-
-[TIP]
-====
-Make sure you are in your `$HOME` directory when clicking the "Python File" square. Otherwise you may get an error stating you do not have permissions to create the file.
-====
-
-Read the https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings["3.8.2 Modules" section] of Google's Python Style Guide. Each individual `.py` file is called a Python "module". It is good practice to include a module-level docstring at the top of each module. Create a module-level docstring for your new module. Rather than giving an explanation of the module, and usage examples, instead include a short description (in your own words, 3-4 sentences) of the terms "serialization" and "deserialization". In addition, list a few (at least 2) examples of different serialization formats, and include a brief description of the format, and some advantages and disadvantages of each. Lastly, if you could break all serialization formats into 2 broad categories, what would those categories be, and why?
-
-[TIP]
-====
-Any good answer for the "2 broad categories" will be accepted. With that being said, a hint would be to think of what the **serialized** data _looks_ like (if you tried to open it in a text editor, for example), or how it is _read_.
-====
-
-Save your module.
-
-**Relevant topics:** xref:book:python:pdoc.adoc[pdoc], xref:book:python:sphinx.adoc[Sphinx], xref:book:python:docstrings-and-comments.adoc[Docstrings & Comments]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Now, in Jupyter Lab, open a new notebook using the "f2021-s2022" kernel (using the link:{attachmentsdir}/project_template.ipynb[course notebook template]).
-
-[TIP]
-====
-You can have _both_ the Python file _and_ the notebook open in separate Jupyter Lab tabs for easier navigation.
-====
-
-Fill in a code cell for question 1 with a Python comment.
-
-[source,python]
-----
-# See firstname-lastname-project02.py
-----
-
-For this question, read the xref:book:python:pdoc.adoc[pdoc section], and run a `bash` command to generate the documentation for your module that you created in the previous question, `firstname-lastname-project02.py`. So, everywhere in the example in the pdoc section where you see "mymodule.py" replace it with _your_ module's name -- `firstname-lastname-project02.py`.
-
-[TIP]
-====
-Use the `-o` flag to specify the output directory -- I would _suggest_ making it somewhere in your `$HOME` directory to avoid permissions issues.
-====
-
-Once complete, on the left-hand side of the Jupyter Lab interface, navigate to your output directory. You should see something called `firstname-lastname-project02.html`. To view this file in your browser, right click on the file, and select btn:[Open in New Browser Tab]. A new browser tab should open with your freshly made documentation. Pretty cool!
-
-[IMPORTANT]
-====
-Ignore the `index.html` file -- we are looking for the `firstname-lastname-project02.html` file.
-====
-
-[TIP]
-====
-You _may_ have noticed that the docstrings are (partially) markdown-friendly. Try introducing some markdown formatting in your docstring for more appealing documentation.
-====
-
-[NOTE]
-====
-At this stage, you have the ability to create a PDF based on the generated webpage (but you do not yet need to do so). To do so, click on menu:File[Print...> Destination > Save to PDF]. This may vary slightly from browser to browser, but it should be fairly straightforward.
-====
-
-**Relevant topics:** xref:book:python:pdoc.adoc[pdoc], xref:book:python:sphinx.adoc[Sphinx], xref:book:python:docstrings-and-comments.adoc[Docstrings & Comments]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-[NOTE]
-====
-When I refer to "watch data" I just mean the dataset for this project.
-====
-
-Write a function to called `get_records_for_date` that accepts an `lxml` etree (of our watch data, via `etree.parse`), and a `datetime.date`, and returns a list of Record Elements, for a given date. Raise a `TypeError` if the date is not a `datetime.date`, or if the etree is not an `lxml.etree`.
-
-Use the https://google.github.io/styleguide/pyguide.html#383-functions-and-methods[Google Python Style Guide's "Functions and Methods" section] to write the docstring for this function. Be sure to include type annotations for the parameters and return value.
-
-Re-generate your documentation. How does the updated documentation look? You may notice that the formatting is pretty ugly and things like "Args" or "Returns" are not really formatted in a way that makes it easy to read.
-
-Use the `-d` flag to specify the format as "google", and re-generate your documentation. How does the updated documentation look?
-
-[TIP]
-====
-The following code should help get you started.
-
-[source,python]
-----
-import lxml.etree
-from datetime import datetime, date
-
-# read in the watch data
-tree = etree.parse('/depot/datamine/data/apple/health/watch_dump.xml')
-
-def get_records_for_date(tree: lxml.etree._ElementTree, for_date: date) -> list[lxml.etree._Element]:
- # docstring goes here
-
- # test if `tree` is an `lxml.etree._ElementTree`, and raise TypeError if not
-
- # test if `for_date` is a `datetime.date`, and raise TypeError if not
-
- # loop through the records in the watch data using the xpath expression `/HealthData/Record`
- # how to see a record, in case you want to
- print(lxml.etree.tostring(record))
-
- # test if the record's `startDate` is the same as `for_date`, and append to a list if it is
-
- # return the list of records
-
-# how to test this function
-chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date()
-my_records = get_records_for_date(tree, chosen_date)
-----
-====
-
-[TIP]
-====
-The following is some code that will be helpful to test the types.
-
-[source,python]
-----
-from datetime import datetime, date
-
-isinstance(some_date_object, date) # test if some_date_object is a date
-isinstance(some_xml_tree_object, lxml.etree._ElementTree) # test if some_xml_tree_object is an lxml.etree._ElementTree
-----
-====
-
-[TIP]
-====
-To loop through records, you can use the `xpath` method.
-
-[source,python]
-----
-for record in tree.xpath('/HealthData/Record'):
- # do something with record
-----
-====
-
-Add this function to your `firstname-lastname-project02.py` file, and if you want, regenerate your new documentation that includes your new function.
-
-**Relevant topics:** xref:book:python:pdoc.adoc[pdoc], xref:book:python:sphinx.adoc[Sphinx], xref:book:python:docstrings-and-comments.adoc[Docstrings & Comments]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Great! Now, write a function called `to_msgpack`, that accepts an `lxml` Element, and an absolute path to the desired output file, checks to make sure it contains the following keys: `type`, `sourceVersion`, `unit`, and `value`, and encodes/serializes, then saves the result to the specified file.
-
-[TIP]
-====
-The following code should help get you started.
-
-[source,python]
-----
-import msgpack
-
-def to_msgpack(element: lxml.etree._Element, file: str) -> None:
- # docstring goes here
-
- # test if `file` is a `str`, and raise TypeError if not
-
- # test if `element` is a `lxml.etree._Element`, and raise TypeError if not
-
- # convert `element.attrib` into a dict
-
- # test if the dict contains the keys `type`, `sourceVersion`, `unit`, and `value`, and raise ValueError if not
-
- # remove "other" non-type/sourceVersion/unit/value keys from the dict
-
- # use msgpack library to serialize the dict to a msgpack file
-
-# how to use this function
-chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date()
-my_records = get_records_for_date(tree, chosen_date)
-to_msgpack(my_records[0], '$HOME/my_records.msgpack')
-----
-====
-
-[IMPORTANT]
-====
-`to_msgpack(my_records[0], '$HOME/my_records.msgpack')` may not work, depending on how you set up your function, you may need to use an absolute path like `to_msgpack(my_records[0], '/home/kamstut/my_records.msgpack')`.
-====
-
-Then, write a function called `from_msgpack`, that accepts an absolute path to a serialized file, and returns an `lxml` Element.
-
-[TIP]
-====
-The following code should help get you started.
-
-[source,python]
-----
-def from_msgpack(file: str) -> lxml.etree._Element:
- # docstring goes here
-
- # test if `file` is a `str`, and raise TypeError if not
-
- # deserialize the msgpack file into a dict
-
- # create new "Record" element
- e = etree.Element('Record')
-
- # loop through keys and values in the dict
- # and set the attributes of the new "Record" element
- # NOTE: This assumed the dict is called "d"
- for key, value in d.items():
- e.attrib[key] = str(value)
-
- # return the new "Record" element
-
-# how to use this function
-print(lxml.etree.tostring(from_msgpack('$HOME/my_records.msgpack')))
-----
-====
-
-[IMPORTANT]
-====
-`print(lxml.etree.tostring(from_msgpack('$HOME/my_records.msgpack')))` may not work, depending on how you set up your function, you may need to use an absolute path like `print(lxml.etree.tostring(from_msgpack('/home/kamstut/my_records.msgpack')))`.
-====
-
-Add these functions to your `firstname-lastname-project02.py` file, and regenerate your documentation. You should see some great looking documentation with your new functions.
-
-**Relevant topics:** xref:book:python:pdoc.adoc[pdoc], xref:book:python:sphinx.adoc[Sphinx], xref:book:python:docstrings-and-comments.adoc[Docstrings & Comments]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-This was _hopefully_ a not-too-difficult project that gave you some exposure to tools in the Python ecosystem, as well as chipped away at any rust you may have had with writing Python code.
-
-Finally, investigate the https://pdoc.dev/docs/pdoc.html[official pdoc documentation], and make at least 2 changes/customizations to your module. Some examples are below -- feel free to get creative and do something with pdoc outside of this list of options:
-
-- Modify the module so you do not need to pass the `-d` flag in order to let pdoc know that you are using Google-style docstrings.
-- Change the logo of the documentation to your own logo (or any logo you'd like).
-- Add some math formulas and change the output accordingly.
-- Edit and customize pdoc's jinja2 template (or CSS).
-
-**Relevant topics:** xref:book:python:pdoc.adoc[pdoc], xref:book:python:sphinx.adoc[Sphinx], xref:book:python:docstrings-and-comments.adoc[Docstrings & Comments]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project03.adoc
deleted file mode 100644
index 61e717f26..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project03.adoc
+++ /dev/null
@@ -1,617 +0,0 @@
-= STAT 39000: Project 3 -- Fall 2021
-
-== Thank yourself later and document now
-
-**Motivation:** Documentation is one of the most critical parts of a project. https://notion.so[There] https://guides.github.com/features/issues/[are] https://confluence.atlassian.com/alldoc/atlassian-documentation-32243719.html[so] https://docs.github.com/en/communities/documenting-your-project-with-wikis/about-wikis[many] https://www.gitbook.com/[tools] https://readthedocs.org/[that] https://bit.ai/[are] https://clickhelp.com[specifically] https://www.doxygen.nl/index.html[designed] https://www.sphinx-doc.org/en/master/[to] https://docs.python.org/3/library/pydoc.html[help] https://pdoc.dev[document] https://github.com/twisted/pydoctor[a] https://swagger.io/[project], and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can't go wrong with tools like https://www.sphinx-doc.org/en/master/[Sphinx], or https://pdoc.dev[pdoc].
-
-**Context:** This is the second project in a 2-project series where we explore thoroughly documenting Python code, while solving data-driven problems.
-
-**Scope:** Python, documentation
-
-.Learning Objectives
-****
-- Use Sphinx to document a set of Python code.
-- Use pdoc to document a set of Python code.
-- Write and use code that serializes and deserializes data.
-- Learn the pros and cons of various serialization formats.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/apple/health/watch_dump.xml`
-
-== Questions
-
-In this project, we are going to use the most popular Python documentation generation tool, Sphinx, to generate documentation for the module we created in project (2). If you chose to skip project (2), the module, in its entirety, will be posted at the latest, this upcoming Monday. You do _not_ need that module to complete this project. Your module from project (2) does not need to be perfect for this project.
-
-Last project was more challenging than intended. This project will provide a bit of a reprieve, and _should_ (hopefully) be fun to mess around with.
-
-project_02_module.py
-[source,python]
-----
-"""This module is for project 2 for STAT 39000.
-
-**Serialization:** Serialization is the process of taking a set or subset of data and transforming it into a specific file format that is designed for transmission over a network, storage, or some other specific use-case.
-
-**Deserialization:** Deserialization is the opposite process from serialization where the serialized data is reverted back into its original form.
-
-The following are some common serialization formats:
-
-- JSON
-- Bincode
-- MessagePack
-- YAML
-- TOML
-- Pickle
-- BSON
-- CBOR
-- Parquet
-- XML
-- Protobuf
-
-**JSON:** One of the more wide-spread serialization formats, JSON has the advantages that it is human readable, and has a excellent set of optimized tools written to serialize and deserialize. In addition, it has first-rate support in browsers. A disadvantage is that it is not a fantastic format storage-wise (it takes up lots of space), and parsing large JSON files can use a lot of memory.
-
-**MessagePack:** MessagePack is a non-human-readable file format (binary) that is extremely fast to serialize and deserialize, and is extremely efficient space-wise. It has excellent tooling in many different languages. It is still not the *most* space efficient, or *fastest* to serialize/deserialize, and remains impossible to work with in its serialized form.
-
-Generally, each format is either *human-readable* or *not*. Human readable formats are able to be read by a human when opened up in a text editor, for example. Non human-readable formats are typically in some binary format and will look like random nonsense when opened in a text editor.
-"""
-
-import lxml
-import lxml.etree
-from datetime import datetime, date
-
-
-def my_function(a, b):
- """
- >>> my_function(2, 3)
- 6
- >>> my_function('a', 3)
- 'aaa'
- >>> my_function(1, 3)
- 4
- """
- return a * b
-
-
-def get_records_for_date(tree: lxml.etree._ElementTree, for_date: date) -> list:
- """
- Given an `lxml.etree` object and a `datetime.date` object, return a list of records
- with the startDate equal to `for_date`.
-
- Args:
- tree (lxml.etree): The watch_dump.xml file as an `lxml.etree` object.
- for_date (datetime.date): The date for which returned records should have a startDate equal to.
-
- Raises:
- TypeError: If `tree` is not an `lxml.etree` object.
- TypeError: If `for_date` is not a `datetime.date` object.
-
- Returns:
- list: A list of records with the startDate equal to `for_date`.
- """
-
- if not isinstance(tree, lxml.etree._ElementTree):
- raise TypeError('tree must be an lxml.etree')
-
- if not isinstance(for_date, date):
- raise TypeError('for_date must be a datetime.date')
-
- results = []
- for record in tree.xpath('/HealthData/Record'):
- if for_date == datetime.strptime(record.attrib.get('startDate'), '%Y-%m-%d %X %z').date():
- results.append(record)
-
- return results
-
-
-def from_msgpack(file: str) -> lxml.etree._Element:
- """
- Given the absolute path a msgpack file, return the deserialized `lxml.Element` object.
-
- Args:
- file (str): The absolute path of the msgpack file to deserialize.
-
- Raises:
- TypeError: If `file` is not a `str`.
-
- Returns:
- lxml.Element: The deserialized `lxml.Element` object.
- """
-
- if not isinstance(file, str):
- raise TypeError('file must be a str')
-
- with open(file, 'rb') as f:
- d = msgpack.load(f)
-
- e = etree.Element('Record')
- for key, value in d.items():
- e.attrib[key] = str(value)
-
- return e
-
-
-def to_msgpack(element: lxml.etree._Element, file: str) -> None:
- """
- Given an `lxml.Element` object and a file path, serialize the `lxml.Element` object to
- a msgpack file at the given file path.
-
- Args:
- element (lxml.Element): The element to serialize.
- file (str): The absolute path of the msgpack file to and save.
-
- Raises:
- TypeError: If `file` is not a `str`.
- TypeError: If `element` is not an `lxml.Element`.
-
- Returns:
- None: None
- """
-
- if not isinstance(file, str):
- raise TypeError('file must be a str')
-
- if not isinstance(element, lxml.etree._Element):
- raise TypeError('element must be an lxml.Element')
-
- # Test if `type`, `sourceVersion`, `unit`, and `value` are present in the element.
- d = dict(element.attrib)
- if not d.get('type') or not d.get('sourceVersion') or not d.get('unit') or not d.get('value'):
- raise ValueError('element must have all of the following keys: type, sourceVersion, unit, and value')
-
- # Remove "other" keys from the dict
- keys_to_remove = []
- for key in d.keys():
- if key not in ['type', 'sourceVersion', 'unit', 'value']:
- keys_to_remove.append(key)
-
- for key in keys_to_remove:
- del d[key]
-
- with open(file, 'wb') as f:
- msgpack.dump(d, f)
-
-if __name__ == '__main__':
- import doctest
- doctest.testmod()
-----
-
-=== Question 1
-
-[IMPORTANT]
-====
-Please use Firefox for this project. If you choose to use Chrome, the appearance of the documentation will be horrible. If you choose to use Chrome anyway, it is recommended that you change a setting in Chrome, temporarily, for this project, by typing (where you would normally put the URL):
-
-----
-chrome://flags
-----
-
-Then, search for "samesite". For "SameSite by default cookies", change from "Default" to "Disabled", and restart the browser.
-====
-
-- Create a new folder in your `$HOME` directory called `project3`.
-- Create a new Jupyter notebook in that folder called `project3.ipynb`, based on the normal project template.
-+
-[NOTE]
-====
-The majority of this notebook will just contain a single `bash` cell with the commands used to re-generate the documentation. This is okay, and by design. The main deliverable for this project will end up being the PDF of the documentation's HTML page.
-====
-+
-- Copy and paste the code from project (2)'s `firstname-lastname-project02.py` module into the `$HOME/project3` directory, you can rename this to be `firstname_lastname_project03.py`.
-- In a `bash` cell in your Jupyter notebook, make sure you `cd` the `project3` folder, and run the following command:
-+
-[source,bash]
-----
-python -m sphinx.cmd.quickstart ./docs -q -p project3 -a "Kevin Amstutz" -v 1.0.0 --sep
-----
-+
-[IMPORTANT]
-====
-Please replace "Kevin Amstutz" with your own name.
-====
-+
-[NOTE]
-====
-What do each of these arguments do? Check out https://www.sphinx-doc.org/en/master/man/sphinx-quickstart.html[this page of the official documentation].
-====
-
-You should be left with a newly created `docs` folder within your `project3` folder. Your structure should look something like the following.
-
-.project03 folder contents
-----
-project03<1>
-├── 39000_f2021_project03_solutions.ipynb<2>
-├── docs<3>
-│ ├── build <4>
-│ ├── make.bat
-│ ├── Makefile <5>
-│ └── source <6>
-│ ├── conf.py <7>
-│ ├── index.rst <8>
-│ ├── _static
-│ └── _templates
-└── kevin_amstutz_project03.py<9>
-
-5 directories, 6 files
-----
-
-<1> Our module (named `project03`) folder
-<2> Your project notebook (probably named something like `firstname_lastname_project03.ipynb`)
-<3> Your documentation folder
-<4> Your empty build folder where generated documentation will be stored
-<5> The Makefile used to run the commands that generate your documentation. Make the following changes:
-+
-[source,bash]
-----
-# replace
-SPHINXOPTS ?=
-SPHINXBUILD ?= sphinx-build
-SOURCEDIR = source
-BUILDDIR = build
-
-# with the following
-SPHINXOPTS ?=
-SPHINXBUILD ?= python -m sphinx.cmd.build
-SOURCEDIR = source
-BUILDDIR = build
-----
-+
-<6> Your source folder. This folder contains all hand-typed documentation.
-<7> Your conf.py file. This file contains the configuration for your documentation. Make the following changes:
-+
-[source,python]
-----
-# CHANGE THE FOLLOWING CONTENT FROM:
-
-# -- Path setup --------------------------------------------------------------
-
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-#
-# import os
-# import sys
-# sys.path.insert(0, os.path.abspath('.')
-
-# TO:
-
-# -- Path setup --------------------------------------------------------------
-
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-#
-import os
-import sys
-sys.path.insert(0, os.path.abspath('../..'))
-----
-+
-<8> Your index.rst file. This file (and all files ending in `.rst`) is written in https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html[reStructuredText] -- a Markdown-like syntax.
-<9> Your module. This is the module containing the code from the previous project, with nice, clean docstrings.
-
-Finally, with the modifications above having been made, run the following command in a `bash` cell in Jupyter notebook to generate your documentation.
-
-[source,bash]
-----
-cd $HOME/project3/docs
-make html
-----
-
-After complete, your module folders structure should look something like the following.
-
-.project03 folder contents
-----
-project03
-├── 39000_f2021_project03_solutions.ipynb
-├── docs
-│ ├── build
-│ │ ├── doctrees
-│ │ │ ├── environment.pickle
-│ │ │ └── index.doctree
-│ │ └── html
-│ │ ├── genindex.html
-│ │ ├── index.html
-│ │ ├── objects.inv
-│ │ ├── search.html
-│ │ ├── searchindex.js
-│ │ ├── _sources
-│ │ │ └── index.rst.txt
-│ │ └── _static
-│ │ ├── alabaster.css
-│ │ ├── basic.css
-│ │ ├── custom.css
-│ │ ├── doctools.js
-│ │ ├── documentation_options.js
-│ │ ├── file.png
-│ │ ├── jquery-3.5.1.js
-│ │ ├── jquery.js
-│ │ ├── language_data.js
-│ │ ├── minus.png
-│ │ ├── plus.png
-│ │ ├── pygments.css
-│ │ ├── searchtools.js
-│ │ ├── underscore-1.13.1.js
-│ │ └── underscore.js
-│ ├── make.bat
-│ ├── Makefile
-│ └── source
-│ ├── conf.py
-│ ├── index.rst
-│ ├── _static
-│ └── _templates
-└── kevin_amstutz_project03.py
-
-9 directories, 29 files
-----
-
-In the left-hand pane in the Jupyter Lab interface, navigate to `$HOME/project3/docs/build/html/`, and right click on the `index.html` file and choose btn:[Open in New Browser Tab]. You should now be able to see your documentation in a new tab.
-
-[IMPORTANT]
-====
-Make sure you are able to generate the documentation before you proceed, otherwise, you will not be able to continue to modify, regenerate, and view your documentation.
-====
-
-.Items to submit
-====
-- Code used to solve this problem (in 2 Jupyter `bash` cells).
-====
-
-=== Question 2
-
-One of the most important documents in any package or project is the README.md file. This file is so important that version control companies like GitHub and GitLab will automatically display it below the repositories contents. This file contains things like instructions on how to install the packages, usage examples, lists of dependencies, license links, etc. Check out some popular GitHub repositories for projects like `numpy`, `pytorch`, or any other repository you've come across that you believe does a good job explaining the project.
-
-In the `docs/source` folder, create a new file called `README.rst`. Choose 3-5 of the following "types" of reStruturedText from the https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html[this webpage], and create a fake README. The content can be https://www.lipsum.com/[Lorem Ipsum] type of content as long as it demonstrates 3-5 of the types of reStruturedText.
-
-- Inline markup
-- Lists and quote-like blocks
-- Literal blocks
-- Doctest blocks
-- Tables
-- Hyperlinks
-- Sections
-- Field lists
-- Roles
-- Images
-- Footnotes
-- Citations
-- Etc.
-
-[IMPORTANT]
-====
-Make sure to include at least 1 https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#sections[section]. This counts as 1 of your 3-5.
-====
-
-Once complete, add a reference to your README to the `index.rst` file. To add a reference to your `README.rst` file, open the `index.rst` file in an editor and add "README" as follows.
-
-.index.rst
-[source,rst]
-----
-.. project3 documentation master file, created by
- sphinx-quickstart on Wed Sep 1 09:38:12 2021.
- You can adapt this file completely to your liking, but it should at least
- contain the root `toctree` directive.
-
-Welcome to project3's documentation!
-====================================
-
-.. toctree::
- :maxdepth: 2
- :caption: Contents:
-
- README
-
-Indices and tables
-==================
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
-----
-
-[IMPORTANT]
-====
-Make sure "README" is aligned with ":caption:" -- it should be 3 spaces from the left before the "R" in "README".
-====
-
-In a new `bash` cell in your notebook, regenerate your documentation. Check out the resulting `index.html` page, and click on the links. Pretty great!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Screenshot or PDF labeled "question02_results".
-====
-
-=== Question 3
-
-The `pdoc` package was specifically designed to generate documentation for Python modules using the docstrings _in_ the module. As you may have noticed, this is not "native" to Sphinx.
-
-Sphinx has https://www.sphinx-doc.org/en/master/usage/extensions/index.html[extensions]. One such extension is the https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html[autodoc] extension. This extension provides the same sort of functionality that `pdoc` provides natively.
-
-To use this extension, modify the `conf.py` file in the `docs/source` folder.
-
-[source,python]
-----
-# -- General configuration ---------------------------------------------------
-
-# Add any Sphinx extension module names here, as strings. They can be
-# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
-# ones.
-extensions = [
- 'sphinx.ext.autodoc'
-]
-----
-
-Next, update your `index.rst` file so autodoc knows which modules to extract data from.
-
-[source,rst]
-----
-.. project3 documentation master file, created by
- sphinx-quickstart on Wed Sep 1 09:38:12 2021.
- You can adapt this file completely to your liking, but it should at least
- contain the root `toctree` directive.
-
-Welcome to project3's documentation!
-====================================
-
-.. automodule:: firstname_lastname_project03
- :members:
-
-.. toctree::
- :maxdepth: 2
- :caption: Contents:
-
- README
-
-Indices and tables
-==================
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
-----
-
-In a new `bash` cell in your notebook, regenerate your documentation. Check out the resulting `index.html` page, and click on the links. Not too bad!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Okay, while the documentation looks pretty good, clearly, Sphinx does _not_ recognize Google style docstrings. As you may have guessed, there is an extension for that.
-
-Add the `napoleon` extension to your `conf.py` file.
-
-[source,python]
-----
-# -- General configuration ---------------------------------------------------
-
-# Add any Sphinx extension module names here, as strings. They can be
-# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
-# ones.
-extensions = [
- 'sphinx.ext.autodoc',
- 'sphinx.ext.napoleon'
-]
-----
-
-In a new `bash` cell in your notebook, regenerate your documentation. Check out the resulting `index.html` page, and click on the links. Much better!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-[WARNING]
-====
-To make it explicitly clear what files to submit for this project:
-
-- `firstname_lastname_project03.py`
-- `firstname_lastname_project03.ipynb`
-- `firstname_lastname_project03.pdf` (result of exporting .ipynb to PDF)
-- `firstname_lastname_project03_webpage.pdf` (result of printing documentation webpage to PDF)
-====
-
-At this stage, you should have a pretty nice set of documentation, with really nice in-code documentation in the form of docstrings. However, there is still another "thing" to add to your docstrings that can take them to the next level.
-
-`doctest` is a standard library tool that allows you to include code, with expected output _inside_ your docstring. Not only can this be nice for the user to see, but both `pdoc` and Sphinx applies special formatting to such additions to a docstring.
-
-Write a super simple function, it could be as simple as adding a couple of digits and returning a value. The following is an example. Come up with your own function with at least 1 passing test and 1 failing test (like the example).
-
-[source,python]
-----
-def add(value1, value2):
- """Function to add two values.
-
- The first example below will pass (because 1+1 is 2), the second will fail (because 1+2 is not 5)
-
- >>> add(1, 1)
- 2
-
- >>> add(1, 2)
- 5
- """
- return value1 + value2
-----
-
-Where ">>>" represents the Python REPL and code demonstrating how you would use the function, and the line immediately following is the expected output.
-
-[IMPORTANT]
-====
-Make sure your function actually does something so you can test to see if it is working as intended or not.
-====
-
-To use doctest, add the following to the bottom of your `firstname_lastname_project03.py` file.
-
-[source,python]
-----
-if __name__ == '__main__':
- import doctest
- doctest.testmod()
-----
-
-Now, in a new `bash` cell in your notebook, run the following command.
-
-[source,bash]
-----
-python kevin_amstutz_project03.py -v
-----
-
-This will actually run your example code in the docstring and compare the output to the expected result! Very cool. We will learn more about this in the next couple of projects.
-
-[NOTE]
-====
-When including the `-v` option, both passing _and_ failing tests will be printed. Without the `-v` option, only failling tests will be printed.
-====
-
-Now, regenerate your documentation again and check it out. Notice how the lines in the docstring are neatly formatted? Pretty great.
-
-Okay, last but not least, check out the themes https://sphinx-themes.org/[here], and choose one of the themes listed, regenerate your documentation, and save the webpage to a PDF for submission. Note that each theme may have slightly different requirements on how to "activate" it. For example, to use the "Readable" theme, you must add the following to your `conf.py` file.
-
-[source,python]
-----
-import sphinx_readable_theme
-html_theme = 'readable'
-html_theme_path = [sphinx_readable_theme.get_html_theme_path()]
-----
-
-[TIP]
-====
-You can change a theme by changing the value of `html_theme` in the `conf.py` file.
-====
-
-[TIP]
-====
-If a theme doesn't work, just select a different theme.
-====
-
-[TIP]
-====
-Unlike `pdoc` which only supports HTML output, Sphinx supports _many_ output formats, including PDF. If interested, feel free to use the following code to generate a PDF of your documentation.
-
-[source,bash]
-----
-module load texlive/20200406
-python -m sphinx.cmd.build -M latexpdf $HOME/project3/docs/source $HOME/project3/docs/build
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project04.adoc
deleted file mode 100644
index 5cc1b771b..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project04.adoc
+++ /dev/null
@@ -1,334 +0,0 @@
-= STAT 39000: Project 4 -- Fall 2021
-
-== Write it. Test it. Change it. https://www.youtube.com/watch?v=7hPX_SresUM[Bop it?]
-
-**Motivation:** Code, especially newly written code, is refactored, updated, and improved frequently. It is for these reasons that testing code is imperative. Testing code is a good way to ensure that code is working as intended. When a change is made to code, you can run a suite a tests, and feel confident (or at least more confident) that the changes you made are not introducing new bugs. While methods of programming like TDD (test-driven development) are popular in some circles, and unpopular in others, what is agreed upon is that writing good tests is a useful skill and a good habit to have.
-
-**Context:** This is the first of a series of two projects that explore writing unit tests, and doc tests. In The Data Mine, we will focus on using `pytest`, doc tests, and `mypy`, while writing code to manipulate and work with data.
-
-**Scope:** Python, testing, pytest, mypy, doc tests
-
-.Learning Objectives
-****
-- Write and run unit tests using `pytest`.
-- Include and run doc tests in your docstrings, using `doctest`.
-- Gain familiarity with `mypy`, and explain why static type checking can be useful.
-- Comprehend what a function is, and the components of a function in Python.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/apple/health/2021/*`
-
-== Questions
-
-[WARNING]
-====
-At the end of this project, you will need to submit the following:
-
-- A notebook (`.ipynb`) with the output of running your tests, and any other code.
-- An updated `watch_data.py` file with the doctests you wrote, fixed function, and custom function.
-- A `test_watch_data.py` file with the `pytest` tests you wrote.
-====
-
-=== Question 1
-
-XPath expressions, while useful, have a very big limitation: the entire XML document must be read into memory. The is a problem for large XML documents. For example, to parse the `export.xml` file in the Apple Health data, takes nearly 7GB of memory when the file is only 980MB.
-
-.prof.py
-[source,python]
-----
-from memory_profiler import profile
-
-@profile
-def main():
- import lxml.etree
-
- tree = lxml.etree.parse("/home/kamstut/apple_health_export/export.xml")
-
-if __name__ == '__main__':
- main()
-----
-
-[source,bash]
-----
-python -m memory_profiler prof.py
-----
-
-.Output
-----
-Filename: prof.py
-
-Line # Mem usage Increment Occurences Line Contents
-============================================================
- 3 36.5 MiB 36.5 MiB 1 @profile
- 4 def main():
- 5 38.5 MiB 2.0 MiB 1 import lxml.etree
- 6
- 7 6975.3 MiB 6936.8 MiB 1 tree = lxml.etree.parse("/home/kamstut/apple_health_export/export.xml")
-----
-
-This is a _very_ common problem, not just for reading XML files, but for dealing with larger dataset in general. You will not always have an abundance of memory to work with.
-
-To get around this issue, you will notice we take a _streaming_ approach, where only parts of a file are read into memory at a time, processed, and then freed.
-
-Copy our library from `/depot/datamine/data/apple/health/apple_watch_parser` and import it into your code cell for question 1. Examine the code and test out at least 2 of the methods or functions.
-
-[TIP]
-====
-To copy the library run the following in a new cell.
-
-[source,ipython]
-----
-%%bash
-
-cp -r /depot/datamine/data/apple/health/apple_watch_parser $HOME
-----
-
-To import and use the library, make sure your notebook (let's say `my_notebook.ipynb`) is in the same directory (the `$HOME` directory) as the `apple_watch_parser` directory. Then, you can import and use the library as follows.
-
-[source,python]
-----
-from apple_watch_parser import watch_data
-
-dat = watch_data.WatchData("/depot/datamine/data/apple/health/2021")
-print(dat)
-----
-====
-
-[TIP]
-====
-You may be asking yourself "well, what does that `dat = watch_data.WatchData("/depot/datamine/data/apple/health/2021")` line do, other than let us print the `WatchData` object?" The answer is, access and utilize the methods _within_ the `WatchData` object using the dot notation. Any function _inside_ the `WatchData` class is called a _method_ and can be access using the dot notation.
-
-For example, if we had a function called `my_function` that was declared _inside_ the `WatchData` class, we would call it as follows:
-
-[source,python]
-----
-from apple_watch_parser import watch_data
-
-dat = watch_data.WatchData("/depot/datamine/data/apple/health/2021")
-dat.my_function(argument1, argument2)
-----
-
-Hopefully this is a good hint on how to use the dot notation to call methods in the `WatchData` class!
-====
-
-[TIP]
-====
-If you run `help(watch_data.time_difference)`, you will get some nice info about the function including a note "Given two strings in the format matching the format in Apple Watch data: YYYY-MM-DD HH:MM:SS -XXXX". What does this mean? These are date/time format code (see https://strftime.org/[here]).
-
-Let's say you have a string `2018-05-21 04:35:49 -0500`, and you want to convert it to a datetime object. To do so you would run the following.
-
-[source,python]
-----
-import datetime
-
-my_datetime_string = '2018-05-21 04:35:49 -0500'
-my_datetime = datetime.datetime.strptime(my_datetime_string, '%Y-%m-%d %H:%M:%S %z')
-----
-
-The string '%Y-%m-%d %H:%M:%S %z' are format codes (see https://strftime.org/[here]). In order to convert from a string to a datetime object, you need to use a combination of format codes that _match_ the format of the string. In this case, the string is '2018-05-21 04:35:49 -0500'. The "2018" part matches "%Y" from the format codes. The "05" part matches "%m" from the format codes. The "21" part matches "%d" from the format codes. The "04" part matches "%H" from the format codes. The "35" part matches "%M" from the format codes. The "49" part matches "%S" from the format codes. The " -0500" part matches "%z" from the format codes. If your datetime string follows a different format, you would need to modify the combination of format codes to use so it matches your datetime string.
-
-Then, once you have a datetime object, you can do all sorts of fun things. The most obvious of which is converting the date back into a string, but formatting it exactly how you want. For example, lets say we dont want a string to have all the details '2018-05-21 04:35:49 -0500' has, and instead just want the month, day, and year using forward slashes instead of hyphens.
-
-[source,python]
-----
-my_datetime.strftime('%m/%d/%Y') # '05/21/2018'
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem -- code that imports and uses our library and at least 2 of the methods or functions.
-- Output from running the code that uses 2 of the methods.
-====
-
-=== Question 2
-
-As you may have noticed, the code contains fairly thorough docstrings. This is a good thing, and it is a good goal to aim for when writing your own Python functions, classes, modules, etc.
-
-In the previous project, you got a small taste of using `doctest` to test your code using in-comment code. This is a great way to test parts of your code that are simple, straightforward, and don't involve extra data or _fixtures_ in order to test.
-
-Examine the code, and determine which functions and/or methods are good candidates for doctests. Modify the docstrings to include at least 3 doctests each, and run the following to test them out!
-
-Include the following doctest in the `calculate_speed` function. This does _not_ count as 1 of your 3 doctests for this function. It _will_ fail for this question -- that is okay!
-
-[source,python]
-----
->>> calculate_speed(5.0, .55, output_distance_unit = 'm')
-Traceback (most recent call last):
- ...
-ValueError: output_distance_unit must be 'mi' or 'km'
-----
-
-[IMPORTANT]
-====
-Make sure to include the expected output of each doctest below each line starting with `>>>`. This means in the code chunk shown above, you should include the "Traceback", "...", and "ValueError" lines as the expected output. Literally just copy and paste that entire code chunk into the `calculate_speed` docstring.
-====
-
-[source,ipython]
-----
-%%bash
-
-python $HOME/apple_watch_parser/watch_data.py -v
-----
-
-[TIP]
-====
-If you need to read in data or type a lot in order to use a function or method, a doctest is probably not the right approach. Hint, hint, try the functions rather than methods.
-====
-
-[TIP]
-====
-There are 2 _functions_ that are good candidates for doctests.
-====
-
-[TIP]
-====
-Don't forget to add the following code to the bottom of `watch_data.py` so doctests will run properly.
-
-[source,python]
-----
-if __name__ == '__main__':
- import doctest
- doctest.testmod()
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-In question 2, we wrote a doctest for the `calculate_speed` function. Figure out why the doctest fails, and make modifications to the function so it passes the doctest. Do _not_ modify the doctest.
-
-[CAUTION]
-====
-When you update the `calculate_speed` function, be sure to first save the `watch_data.py` file and then re-import the package so that way your modifications take effect.
-====
-
-[TIP]
-====
-Remember we want you to change the `calculate_speed` function to pass the doctest -- not change the doctest to make it pass.
-====
-
-[TIP]
-====
-The output of `calculate_speed(5.0, .55, output_distance_unit = 'm')` is `9.09090909090909`, but we _want_ it to be `ValueError: output_distance_unit must be 'mi' or 'km'` because 'm' isn't one of the two valid values, 'mi' or 'km'. Modify the `calculate_speed` function so it raises that error when the `output_distance_unit` parameter is not one of the two valid values.
-====
-
-[TIP]
-====
-Look carefully at the `_convert_distance` helper function -- that is where you will want to make modifications. Your logic within each `distance_unit` if statement should be along the lines of: "Is the `output_distance_unit` parameter 'mi'? If so, convert and/or return this distance. Is it 'km'? If so, convert and/or return this distance. Otherwise, raise an error because `output_distance_unit` should only be 'mi' or 'km'."
-====
-
-To run the doctest:
-
-[source,ipython]
-----
-%%bash
-
-python $HOME/apple_watch_parser/watch_data.py -v
-----
-
-This is what doctests are for! This helps you easily identify that something fundamental has changed and the code isn't ready for production. You can imagine a scenario where you automatically run all doctests automatically before releasing a new product, and having that system notify you when a test fails -- very cool!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-While doctests are good for simple testing, a package like `pytest` is better. For the stand alone functions, write at least 2 tests each using `pytest`. Make sure these tests test _different_ inputs than your doctests did -- its not hard to come up with lots of tests!
-
-[NOTE]
-====
-This could end up being just 2 functions that run a total of 4 tests -- that is okay! As long as each function has at least 2 assert statements.
-====
-
-Start by adding a new file called `test_watch_data.py` to your `$HOME/apple_watch_parser` directory. Then, fill the file with your tests. When ready to test, run the following in a new cell.
-
-[source,ipython]
-----
-%%bash
-
-cd $HOME/apple_watch_parser
-python -m pytest
-----
-
-[NOTE]
-====
-You may have noticed that we arbitrarily chose to place some functions _outside_ of our `WatchData` class, and others inside. There is no hard and fast rule to determine if a function belongs inside or outside of a class. In general, however, if a function is related to the class, and works with the attributes/data of the class, it should be inside the class. If the function has no relationship to the class, or could be useful using other types of data, it should be outside of the class.
-
-Of course, there are exceptions to this rule, and it is possible to write _static_ methods for a class, which operate independently of the class and its attributes. We chose to write the functions outside of the class, more for demonstration purposes than anything else. They are functions that would most likely not be useful in any other context, but sort of demonstrate the concept and allow us to have good functions to practice writing doctests and `pytest` tests _without_ fixtures.
-====
-
-In the following project, we will continue to learn about `pytest`, including some more advanced features, like fixtures.
-
-**Relevant topics:** xref:book:python:pytest.adoc[pytest]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Explore the data -- there is a lot! Think of a function that could be useful for this module that would live _outside_ of the `WatchData` class. Write the function. Include Google style docstrings, doctests (at least 2), and `pytest` tests (at least 2, _different_ from your doctests). Re-run both your `doctest` tests and `pytest` tests.
-
-[NOTE]
-====
-You can simply add this function to your `watch_data.py` module, and run the tests just like you did for the previous questions!
-====
-
-[NOTE]
-====
-Your function doesn't _need_ to be useful for data outside the `WatchData` class (you won't lose credit if it isn't really), but make an attempt! There are more types of elements and data that you can look at too other than just the `Workout` tags in the `export.xml` file. There is GPX data (xml data that can be used to map a workout route) in the `/depot/datamine/data/apple/health/2021/workout-routes/` directory. Lots of options!
-====
-
-[TIP]
-====
-One way to peek around at the data (without having your notebook/kernel crash due to out of memory (OOM) errors) is something like the following:
-
-[source,python]
-----
-from lxml import etree
-
-tree = etree.iterparse("/depot/datamine/data/apple/health/2021/export.xml")
-ct = 0
-for event, element in tree:
- if element.tag == 'Workout':
- print(etree.tostring(element))
- ct += 1
- if ct > 100:
- break
- else:
- element.clear()
-
-# to extract an element's attributes
-element.attrib # dict-like object
-----
-====
-
-**Relevant topics:** xref:book:python:pytest.adoc[pytest], xref:book:data:html.adoc[html], xref:book:data:xml.adoc[xml]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project05.adoc
deleted file mode 100644
index aa4f5275e..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project05.adoc
+++ /dev/null
@@ -1,362 +0,0 @@
-= STAT 39000: Project 5 -- Fall 2021
-
-== Testing Python: part II
-
-**Motivation:** Code, especially newly written code, is refactored, updated, and improved frequently. It is for these reasons that testing code is imperative. Testing code is a good way to ensure that code is working as intended. When a change is made to code, you can run a suite a tests, and feel confident (or at least more confident) that the changes you made are not introducing new bugs. While methods of programming like TDD (test-driven development) are popular in some circles, and unpopular in others, what is agreed upon is that writing good tests is a useful skill and a good habit to have.
-
-**Context:** This is the second in a series of two projects that explore writing unit tests, and doc tests. In The Data Mine, we will focus on using `pytest`, and `mypy`, while writing code to manipulate and work with data.
-
-**Scope:** Python, testing, pytest, mypy
-
-.Learning Objectives
-****
-- Write and run unit tests using `pytest`.
-- Include and run doc tests in your docstrings, using `doctest`.
-- Gain familiarity with `mypy`, and explain why static type checking can be useful.
-- Comprehend what a function is, and the components of a function in Python.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/apple/health/2021/*`
-
-== Questions
-
-[WARNING]
-====
-At this end of this project you will have only 3 files to submit:
-
-- `watch_data.py`
-- `test_watch_data.py`
-- `firstname-lastname-project05.ipynb`
-
-Make sure that the output from running the cells is displayed and saved in your notebook before submitting.
-====
-
-=== Question 1
-
-First, setup your workspace in Brown. Create a new folder called `project05` in your `$HOME` directory. Make sure your `apple_watch_parser` package from the previous project is in your `$HOME/project05` directory. In addition, create your new notebook, `firstname-lastname-project05.ipynb` in your `$HOME/project05` directory. Great!
-
-[NOTE]
-====
-An updated `apple_watch_parser` package will be made available for this project on Saturday, September 25, in `/depot/datamine/data/apple/health/apple_watch_parser02`. To copy this over to your `$HOME/project05` directory, run the following in a terminal on Brown.
-
-[source,bash]
-----
-cp -R /depot/datamine/data/apple/health/apple_watch_parser02 $HOME/project05/apple_watch_parser
-----
-
-Note that this updated package is not required by any means to complete this project.
-====
-
-Now, in `$HOME/project05/apple_watch_parser/test_watch_data.py`, modify the `test_time_difference` function to use parametrizing to test the time difference between 100 sets of times.
-
-[TIP]
-====
-The following is an example of how to generate a list of 100 datetimes 1 day and 1 hour apart.
-
-[source,python]
-----
-import pytz
-import datetime
-
-start_time = datetime.datetime.now(pytz.utc)
-one_day = datetime.timedelta(days=1, hours=1)
-
-list_of_datetimes = [start_time+one_day*i for i in range(100)]
-----
-====
-
-[TIP]
-====
-An example of how to convert a datetime to a string in the same format our `time_difference` function expects is below.
-
-[source,python]
-----
-import pytz
-import datetime
-
-my_datetime = datetime.datetime.now(pytz.utc)
-my_string = my_datetime.strftime('%Y-%m-%d %H:%M:%S %z')
-----
-====
-
-[TIP]
-====
-See the very first example https://docs.pytest.org/en/6.2.x/parametrize.html[here] for how to parametrize a test for a function accepting 2 arguments instead of 1.
-====
-
-[TIP]
-====
-The `zip` function in Python will be particularly useful. Note in first example https://docs.pytest.org/en/6.2.x/parametrize.html[here], that the second argument to the `@pytest.mark.parametrize()` decorator is a list of tuples. `zip` accepts _n_ lists of _m_ elements, and returns a list of _m_ tuples, where each tuple contains the elements of the lists in the same order.
-
-[source,python]
-----
-zip([1,2,3], [5,5,5], [9 for i in range(3)])
-----
-====
-
-[TIP]
-====
-You do _not_ need to manually calculate the expected result for each combination of datetime's that you will pass to the `time_difference` function. Since you know exactly how many seconds you put between the datetime's you automatically generated, you can just use those values for the expected result. For example, if you generated 100 datetimes that are each 1 day apart, you will know that the expected difference is 86400 seconds, and could pass a list of the value 86400 repeated 100 times to the `test_time_difference` function's third argument (the expected results).
-====
-
-Run the `pytest` tests from a bash cell in your notebook.
-
-[source,ipython]
-----
-%%bash
-
-cd $HOME/project05
-python -m pytest
-----
-
-**Relevant topics:** xref:book:python:pytest.adoc#parametrizing-tests[parametrizing tests]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Read and understand the `filter_elements` method in the `WatchData` class. This is an example of a function that accepts another function as an argument. A term for functions that accept other functions as arguments are _higher-order functions_. Think about the `filter_elements` method, and list at least 2 reasons _why_ this may be a good idea.
-
-The docstring contains an explanation of the `filter_elements` method. In addition, the module provides a function called `example_filter` that can be used with the `filter_elements` method.
-
-Use the `example_filter` function to filter the `WatchData` class, and print the first 5 results.
-
-[TIP]
-====
-Remember to import and use the package, make sure that the notebook is in the same directory as the `apple_watch_parser` package.
-
-[source,python]
-----
-from apple_watch_parser import watch_data
-
-dat = watch_data.WatchData('/depot/datamine/data/apple/health/2021/')
-print(dat)
-----
-====
-
-[TIP]
-====
-When passing a function as an argument to another function, you should _not_ include the opening and closing parentheses in the argument. For example, the following is _not_ correct.
-
-[source,python]
-----
-dat.filter_elements(example_filter())
-----
-
-Why? Because the `example_filter()` part will try to _evaluate_ the function and will essentially be translated into the output of running `example_filter()`, and we don't want it to. We want to pass the function itself, so that the `filter_elements` method can _use_ the `example_filter` function internally.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Write your own `*_filter` function in a Python code cell in your notebook (like `example_filter`) that can be used with the `filter_elements` method. Be sure to include a Google style docstring (no doctests are needed).
-
-Does it work as intended? Print the first 5 results when using your filter.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-In the previous project, we did _not_ test out the `filter_elements` method in our `WatchData` class. Testing this method is complicated for two main reasons.
-
-. The method accepts _any_ function following a set of rules (described in our docstring) as an argument. This (a `*_filter` function) may not be something that is available after immediately importing the `WatchData` class -- normally there wouldn't be an `example_filter` function in the module for you to use, as this would be a function that a user of the library would create for their own purposes.
-. In order to be able to test the `filter_elements` method, we would need a dataset that is similarly structured as the intended dataset (Apple Watch exports), that we _know_ the expected output for, so we can test.
-
-`pytest` supports writing fixtures that can be used to solve these problems.
-
-To address problem (1):
-
-- Remove the `example_filter` function from the `watch_data.py` module, and instead, modify the `test_watch_data.py` file and add the `example_filter` function to the `test_watch_data.py` module as a `pytest` fixture. Read https://docs.pytest.org/en/6.2.x/fixture.html#what-fixtures-are[this] and 2 or 3 following sections. In addition, see https://stackoverflow.com/a/44701916[this] stackoverflow answer to better understand how to create a fixture that is a function that can accept arguments.
-
-[CAUTION]
-====
-You may need to import lxml and other libraries in your `test_watch_data.py` file. For safety, you can just add the following.
-
-[source,python]
-----
-import watch_data
-import pytest
-from pathlib import Path
-import os
-import lxml.etree
-import pytz
-import datetime
-----
-====
-
-[NOTE]
-====
-Why do we need to do something like the stackoverflow post describes? The reason is, by default, `pytest` will assume that the argument, `element`, to the `example_filter` function is a fixture itself, and won't work! This is the workaround.
-====
-
-[TIP]
-====
-In the example in the https://stackoverflow.com/a/44701916[stackoverflow post], the `_method(a,b)` function is the equivalent of the `example_filter` function.
-
-As a side note, sometimes helper functions (functions defined and used inside of another function) are called helper functions, and it is good practice to name them starting with an underscore -- just like the `_method(a,b)` function in the stackoverflow post.
-====
-
-[TIP]
-====
-You can start by cutting the `example_filter` function from `watch_data.py` and paste it in `test_watch_data.py`. Then, to make it a _fixture_, wrap it in another function just like in the https://stackoverflow.com/a/44701916[stackoverflow post].
-====
-
-To address problem (2):
-
-- Create a new `test_data` directory in your `apple_watch_parser` package. So, `$HOME/project05/apple_watch_parser/test_data` should now exist. Add `/depot/datamine/data/apple/health/2021/sample.xml` to this directory, and rename it to `export.xml`. So, `$HOME/project05/apple_watch_parser/test_data/export.xml` should now exist.
-+
-[NOTE]
-====
-`sample.xml` is a small sample of the the watch data that we can use for out tests. It is small enough to be portable, yet is similar enough to the intended types of datasets that it will be a good way to test our `WatchData` class and its methods. Since we renamed it to `export.xml`, it will work with our `WatchData` class.
-====
-+
-- Create a `test_filter_elements` function in your `test_watch_data.py` module. Use https://pypi.org/project/pytest-datafiles/[this] library (already installed), to handle properly copying the `test_data/export.xml` file to a temporary directory for the test. Examples 2 and 3 https://pypi.org/project/pytest-datafiles/[here] will be particularly helpful.
-+
-[NOTE]
-====
-You may be wondering _why_ we would want to use this library for our test rather than just hard-coding the path to our test files in our test function(s). The reason is the following. What if one of your functions had a side-effect that _modified_ your test data? Then, any other tests you run using the same data would be tainted and potentially fail! Bad news. This package allows for a systematic way to first copy our test data to a temporary location, and _then_ run our test using the data in that temporary location.
-
-In addition, if you have many test function that work on the _same_ dataset, you can do something like the following to re-use the code over and over again.
-
-[source,python]
-----
-export_xml_decorator = pytest.mark.datafiles(...)
-
-@export_xml_decorator
-def test_1(datafiles):
- pass
-
-@export_xml_decorator
-def test_2(datafiles):
- pass
-----
-
-Each of the tests, `test_1` and `test_2`, will work on the same example dataset, but will have a fresh copy of the dataset each time. Very cool!
-====
-+
-[TIP]
-====
-The decorator, `@pytest.mar.datafiles()` is expecting a path to the test data, `export.xml`. To get the absolute path to the test data, `$HOME/project05/apple_watch_parser/test_data/export.xml`, you can use the `pathlib` library.
-
-.test_watch_data.py
-[source,python]
-----
-import watch_data # since watch_data.py is in the same directory as test_watch_data.py, we can import it directly
-from pathlib import Path
-
-# To get the path of the watch_data Python module
-this_module_path = Path(watch_data.__file__).resolve().parent
-print(this_module_path) # $HOME/project05/apple_watch_parser
-
-# To get the test_data folders absolute path, we could then do
-print(this_module_path / 'test_data') # $HOME/project05/apple_watch_parser/test_data
-
-# To get the test_data/export.xml absolute path, we could then do ...?
-# HINT: The answer to this question is _exactly_ what should be passed to the `@pytest.mark.datafiles()` decorator.
-@pytest.mark.datafiles(answer_here)
-def test_filter_elements(datafiles, example_filter_fixture): # replace example_filter_fixture with the name of your fixture function
- pass
-----
-====
-
-Okay, great! Your `test_watch_data.py` module should now have 2 additional functions, "symbolically" something like this:
-
-[source,python]
-----
-# from https://stackoverflow.com/questions/44677426/can-i-pass-arguments-to-pytest-fixtures
-@pytest.fixture
-def my_fixture():
-
- def _method(a, b):
- return a*b
-
- return _method
-
-@pytest.mark.datafiles(answer_here)
-def test_filter_elements(datafiles, my_fixture):
- pass
-----
-
-Fill in the `test_filter_elements` function with at least 1 `assert` statements that tests the `filter_elements` function. It could be as simple as comparing the length of the output when using the `example_filter` function as our filter. `test_data/example.xml` should return 2 elements using our `example_filter` function as the filter.
-
-[TIP]
-====
-As a reminder, to run `pytest` from a bash cell in your notebook (which should be in the same directory as your `apple_watch_parser` directory, or `$HOME/project05/apple_watch_parser/firstname-lastname-project05.ipynb`), you can run the following.
-
-[source,ipython]
-----
-%%bash
-
-cd $HOME/project05
-python -m pytest
-----
-====
-
-[NOTE]
-====
-If you get an error that says pytest.mark.datafiles isn't defined, or something similar, do not worry, this can be ignored. Alternatively, if you add a file called `pytest.ini` to your `$HOME/project05` directory, with the following contents, this warning will go away.
-
-.pytest.ini
-----
-[pytest]
-markers =
- datafiles: mark a test as a datafiles.
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Create an additional method in the `WatchData` class in the `watch_data.py` module that does something interesting or useful with the data. Be sure to include a Google style docstring (no doctests are needed). In addition, write 1 or more `pytest` tests for your new method that uses fixtures. Make sure your test passes (you can run your `pytest` tests from a `bash` cell in your notebook).
-
-If you are up for a bigger challenge, design your new method to be similar to `filter_elements` in that a user can write their own functions or classes that can be passed to it (as arguments) in order to accomplish something useful that they _may_ want to be customized.
-
-[IMPORTANT]
-====
-We will count the use of the `@pytest.mark.datafiles()` decorator as a fixture, if you decide to not complete the "bigger challenge".
-====
-
-Make sure to run the `pytest` tests from a bash cell in your notebook.
-
-[source,ipython]
-----
-%%bash
-
-cd $HOME/project05
-python -m pytest
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project06.adoc
deleted file mode 100644
index bd34b2178..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project06.adoc
+++ /dev/null
@@ -1,374 +0,0 @@
-= STAT 39000: Project 6 -- Fall 2021
-
-== Sharing Python code: Virtual environments & git part I
-
-**Motivation:** Python is an incredible language that enables users to write code quickly and efficiently. When using Python to write code, you are _typically_ making a tradeoff between developer speed (the time in which it takes to write a functioning program or scripts) and program speed (how fast your code runs). This is often the best choice depending on your staff and how much your software developers or data scientists earn. However, Python code does _not_ have the advantage of being able to be compiled to machine code for a certain architecture (x86_64, ARM, Darwin, etc.), and easily shared. In Python you need to learn how to use virtual environments (and git) to share your code.
-
-**Context:** This is the first in a series of 3 projects that explores how to setup and use virtual environments, as well as some `git` basics. This series is not intended to teach you everything you need to know, but rather to give you some exposure so the terminology and general ideas are not foreign to you.
-
-**Scope:** Python, virtual environments, git
-
-.Learning Objectives
-****
-- Explain what a virtual environment is and why it is important.
-- Create, update, and use a virtual environment to run somebody else's Python code.
-- Use git to create a repository and commit changes to it.
-- Understand and utilize common `git` commands and workflows.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Questions
-
-[NOTE]
-====
-While this project may _look_ like it is a lot of work, it is probably one of the easiest projects you will get this semester. The question text is long, but it is mostly just instructional content and directions. If you just carefully read through it, it will probably take you well under 1 hour to complete!
-====
-
-=== Question 1
-
-Sign up for a free GitHub account at https://github.com[https://github.com]. If you already have a GitHub account, perfect!
-
-Once complete, type your GitHub username into a markdown cell.
-
-.Items to submit
-====
-- Your GitHub username in a markdown cell.
-====
-
-=== Question 2
-
-We've created a repository for this project at https://github.com/TheDataMine/f2021-stat39000-project6. You'll quickly see that the code will be ultra familiar to you. The goal of this question, is to xref:book:git:git.adoc#clone[clone] the repository to your `$HOME` directory. Some of you may already be rushing off to your Jupyter Notebook to run the following.
-
-[source,ipython]
-----
-%%bash
-
-git clone https://github.com/TheDataMine/f2021-stat39000-project6
-----
-
-Don't! Instead, we are going to take the time to setup authentication with GitHub using SSH keys. Don't worry, it's _way_ easier than it sounds!
-
-[NOTE]
-====
-P.S. As usual, you should have a notebook called `firstname-lastname-project06.ipynb` (or something similar) in your `$HOME` directory, and you should be using `bash` cells to run and track your `bash` code.
-====
-
-The first step is to create a new SSH key pair on Brown, in your `$HOME` directory. To do that, simply run the following in a bash cell.
-
-[IMPORTANT]
-====
-If you know what an SSH key pair is, and already have one setup on Brown, you can skip this step.
-====
-
-[source,ipython]
-----
-%%bash
-
-ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519 -C "lastname_brown_key"
-----
-
-When prompted for a passphrase, just press enter twice _without_ entering a passphrase. If it doesn't prompt you, it probably already generated your keys! Congratulations! You have your new key pair!
-
-So, what is a key pair, and what does it look like? A key pair is two files on your computer (or in this case, Brown). These files live inside the following directory `~/.ssh`. Take a look by running the following in a bash cell.
-
-[source,bash]
-----
-ls -la ~/.ssh
-----
-
-.Output
-----
-...
-id_ed25519
-id_ed25519.pub
-...
-----
-
-The first file, `id_ed25519` is your _private_ key. It is critical that you do not share this key with anybody, ever. Anybody in possession of this key can login to any system with an associated _public_ key, as _you_. As such, on a shared system (with lots of users, like Brown), it is critical to assign the correct permissions to this file. Run the following in a bash cell.
-
-[source,bash]
-----
-chmod 600 ~/.ssh/id_ed25519
-----
-
-This will ensure that you, as the _owner_ of the file, have the ability to both read and write to this file. At the same time, this prevents any other user from being able to read, write, or execute this file (with the exception of a superuser). It is also important get the permissions of files within `~/.ssh` correct, as `openssh` will not work properly otherwise (for safety).
-
-Great! The other file, `id_ed25519.pub` is your _public_ key. This is the key that is shareable, and that allows a third party to verify that "the user trying to access resource X has the associated _private_ key." First, lets set the correct permissions by running the following in a bash cell.
-
-[source,bash]
-----
-chmod 644 ~/.ssh/id_ed25519.pub
-----
-
-This will ensure that you, as the _owner_ of the file, have the ability to both read and write to this file. At the same time, everybody else on the system will have read and execute permissions.
-
-Last, but not least run the following to correctly set the permission of the `~/.ssh` directory.
-
-[source,ipython]
-----
-%%bash
-
-chmod 700 ~/.ssh
-----
-
-Now, take a look at the contents of your _public_ key by running the following in a bash cell.
-
-[source,ipython]
-----
-%%bash
-
-cat ~/.ssh/id_ed25519.pub
-----
-
-Not a whole lot to it, right? Great. Copy this file to your clipboard. Now, navigate and login to https://github.com if you haven't already. Click on your profile in the upper-right-hand corner of the screen, and then click btn:[Settings].
-
-[NOTE]
-====
-If you haven't already, this is a fine time to explore the various GitHub settings, set a profile picture, add a bio, etc.
-====
-
-In the left-hand menu, click on btn:[SSH and GPG keys].
-
-In the next screen, click on the green button that says btn:[New SSH key]. Fill in the "Title" field with anything memorable. I like to put a description that tells me where I generated the key (on what computer), for example, "brown.rcac.purdue.edu". That way, I can know if I can delete that key down the road when cleaning things out. In the "Key" field, paste your public key (the output from running the `cat` command in the previous code block). Finally, click the button that says btn:[Add SSH key].
-
-Congratulations! You should now be able to easily authenticate with GitHub from Brown, how cool! To test the connection, run the following in a cell.
-
-[source,ipython]
-----
-!ssh -o "StrictHostKeyChecking no" -T git@github.com
-----
-
-[NOTE]
-====
-If you use the following -- you will get an error, but as long as it says "Hi username! ..." at the top, you are good to go!
-
-[source,ipython]
-----
-%%bash
-
-ssh -T git@github.com
-----
-====
-
-If you were successful, it should reply with something like:
-
-----
-Hi username! You've successfully authenticated, but GitHub does not provide shell access.
-----
-
-[NOTE]
-====
-If it asks you something like "Are you sure you want to continue connecting (yes/no)?", type "yes" and press enter.
-====
-
-Okay, FINALLY, let's get to the actual task! Clone the repository to your `$HOME` directory, using SSH rather than HTTPS.
-
-[TIP]
-====
-If you navigate to the repository in the browser, click on the green "<> Code" button, you will get a dropdown menu that allows you to select "SSH", which will then present you with the string you can use in combination with the `git clone` command to clone the repository.
-====
-
-Upon success, you should see a new folder in your `$HOME` directory, `f2021-stat39000-project6`.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Take a peek into your freshly cloned repository. You'll notice a couple of files that you may not recognize. Focus on the `pyproject.toml` file, and `cat` it to see the contents.
-
-The `pyproject.toml` file contains the build system requirements of a given Python project. It can be used with `pip` or some other package installer to download the _exact_ versions of the _exact_ packages (like `pandas`, for example) required in order to build and/or run the project!
-
-Typically, when you are working on a project, and you've cloned the project, you want to build the exact environment that the developer had set up when developing the project. This way you ensure that you are using the exact same versions of the same packages, so you can expect things to function the same way. This is _critical_, as the _last_ thing you want to have to deal with is figuring out _why_ your code is not working but the developers or project maintainers _is_.
-
-There are a variety of popular tools that can be used for dependency management and/or virtual environment management in Python. The most popular are: https://docs.conda.io/en/latest/[conda], https://pipenv.pypa.io/en/latest/[pipenv], and https://python-poetry.org/[poetry].
-
-[NOTE]
-====
-What is a "virtual environment"? In a nutshell, a virtual environment is a Python installation such that the interpreter, libraries, and scripts that are available in the virtual environment are distinct and separate from those in _other_ virtual environments or the _system_ Python installation.
-
-We will dig into this more.
-====
-
-There are pros and cons to each of these tools, and you are free to explore and use what you like. Having used each of these tools exclusively for at least 1 year or more, I have had the fewest issues with poetry.
-
-[NOTE]
-====
-When I say "issues" here, I mean unresolved bugs with open tickets on the project's GitHub page. For that reason, we will be using poetry for this project.
-====
-
-Poetry was used to create the `pyproject.toml` file you see in the repository. Poetry is already installed in Brown. See where by running the following in a bash cell.
-
-[source,bash]
-----
-which poetry
-----
-
-By default, when creating a virtual environment using poetry, each virtual environment will be saved to `$HOME/.cache/pypoetry`, while this is not particularly bad, there is a configuration option we can set that will instead store the virtual environment in a projects own directory. This is a nice feature if you are working on a shared compute space as it is explicitly clear where the environment is located, and theoretically, you will have access (as it is a shared space). Let's set this up. Run the following command.
-
-[source,ipython]
-----
-%%bash
-
-poetry config virtualenvs.in-project true
-poetry config cache-dir "$HOME/.cache/pypoetry"
-poetry config --list
-----
-
-This will create a `config.toml` file in `$HOME/.config/pypoetry/config.toml` that is where your settings are saved.
-
-Finally, let's setup your _own_ virtual environment to use with your cloned `f2021-stat39000-project6` repository. Run the following commands.
-
-[source,bash]
-----
-module unload python/f2021-s2022-py3.9.6
-cd $HOME/f2021-stat39000-project6
-poetry install
-----
-
-[IMPORTANT]
-====
-This may take a minute or two to run.
-====
-
-[NOTE]
-====
-Normally, you'd be able to skip the `module unload` part of the command, however, this is required since we are already _in_ a virtual environment (f2021-s2022 kernel). Otherwise, poetry would not install the packages into the correct location.
-====
-
-This should install all of the dependencies and the virtual environment in `$HOME/f2021-stat39000-project6/.venv`. To check run the following.
-
-[source,bash]
-----
-ls -la $HOME/f2021-stat39000-project6/
-----
-
-To actually _use_ this virtual environment (rather than our kernel's Python environment, or the _system_ Python installation), preface `python` commands with `poetry run`. For example, let's say we want to run a script in the package. Instead of running `python script.py`, we can run `poetry run python script.py`. Test it out!
-
-[WARNING]
-====
-For each bash cell when running poetry commands -- it is critical the cells begin as follows:
-
-[source,ipython]
-----
-%%bash
-
-module unload python/f2021-s2022-py3.9.6
-----
-
-Otherwise, poetry will not use the correct Python environment. This is a side effect of the way we have our installation, normally, poetry will know to use the correct Python environment for the project.
-====
-
-We have a file called `runme.py` in the `scripts` directory (`$HOME/f2021-stat39000-project6/scripts/runme.py`). This script just quickly uses our package and prints some info -- nothing special. Run the script using the virtual environment.
-
-[IMPORTANT]
-====
-You may need to provide execute permissions to the runme files.
-
-[source,bash]
-----
-chmod 700 $HOME/f2021-stat39000-project6/scripts/runme.py
-chmod 700 $HOME/f2021-stat39000-project6/scripts/runme2.py
-----
-====
-
-[source,ipython]
-----
-%%bash
-
-module unload python/f2021-s2022-py3.9.6
-chmod 700 $HOME/f2021-stat39000-project6/scripts/runme.py
-chmod 700 $HOME/f2021-stat39000-project6/scripts/runme2.py
-cd $HOME/f2021-stat39000-project6
-poetry run python scripts/runme.py
-----
-
-[TIP]
-====
-The script will print the location of the `pandas` package as well -- if it starts with `$HOME/f2021-stat39000-project6/.venv/` then you are correctly running the script using our environment! Otherwise, you are not and need to remember to use poetry.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Now, try to run the following script using our virtual environment: `$HOME/f2021-stat39000-project6/scripts/runme2.py`. What happens?
-
-[IMPORTANT]
-====
-Make sure to run the script from the project folder and _not_ from the `$HOME` directory. `poetry` looks for a `pyproject.toml` file in the current directory, and if it doesn't find it, it will throw an error, but this error will not show you what package is missing. So, to be clear. Don't do:
-
-[source,ipython]
-----
-%%bash
-
-module unload python/f2021-s2022-py3.9.6
-poetry run python $HOME/f2021-stat39000-project6/scripts/runme2.py
-----
-
-But _do_ run:
-
-[source,ipython]
-----
-%%bash
-
-module unload python/f2021-s2022-py3.9.
-cd $HOME/f2021-stat39000-project6
-poetry run python scripts/runme2.py
-----
-====
-
-It looks like a package wasn't found, and should be added to our environment (and therefore our `pyproject.toml` file). Run the following command to install the package to your virtual environment.
-
-[source,bash]
-----
-module unload python/f2021-s2022-py3.9.6
-cd $HOME/f2021-stat39000-project6
-poetry add packagename # where packagename is the name of the package/module you want to install (that was found to be missing)
-----
-
-Does the `pyproject.toml` reflect this change? Now try and run the script again -- voila!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Read about at least 1 of the 2 git workflows listed xref:book:git:workflows.adoc[here] (if you have to choose 1, I prefer the "GitHub flow" style). Describe in words the process you would use to add a function or method to our repo, step by step, in as much detail as you can. I will start for you, with the "GitHub flow" style.
-
-. Add the function or method to the `watch_data.py` module in `$HOME/f2021-stat39000-project6/`.
-. ...
-. Deploy the the branch (this could be a website, or package being used somewhere) for final testing, before merging into the `main` branch where code should be pristine and able to be immediately deployed at any time and function as intended.
-. ...
-
-[TIP]
-====
-The goal of this question is to try as hard as you can to understand at a high level what a work flow like this enables, the steps involved, and think about it from a perspective of working with 100 other data scientists and/or software engineers. Any details, logic, or explanation you want to provide in the steps would be excellent!
-====
-
-[TIP]
-====
-You do _not_ need to specify actual `git` commands if you do not feel comfortable doing so, however, it may come in handy in the next project (_hint hint_).
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project07.adoc
deleted file mode 100644
index 42c7cdb91..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project07.adoc
+++ /dev/null
@@ -1,430 +0,0 @@
-= STAT 39000: Project 7 -- Fall 2021
-
-**Motivation:** Python is an incredible language that enables users to write code quickly and efficiently. When using Python to write code, you are _typically_ making a tradeoff between developer speed (the time in which it takes to write a functioning program or scripts) and program speed (how fast your code runs). This is often the best choice depending on your staff and how much your software developers or data scientists earn. However, Python code does _not_ have the advantage of being able to be compiled to machine code for a certain architecture (x86_64, ARM, Darwin, etc.), and easily shared. In Python you need to learn how to use virtual environments (and git) to share your code.
-
-**Context:** This is the second in a series of 3 projects that explores how to setup and use virtual environments, as well as some `git` basics. This series is not intended to teach you everything you need to know, but rather to give you some exposure so the terminology and general ideas are not foreign to you.
-
-**Scope:** Python, virtual environments, git
-
-.Learning Objectives
-****
-- Explain what a virtual environment is and why it is important.
-- Create, update, and use a virtual environment to run somebody else's Python code.
-- Use git to create a repository and commit changes to it.
-- Understand and utilize common `git` commands and workflows.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/movies_and_tv/imdb.db`
-
-In addition, the following is an illustration of the database to help you understand the data.
-
-image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"]
-
-== Questions
-
-[NOTE]
-====
-This will be another project light on code and data, which we will reintroduce in the next project. While the watch data is a pretty great dataset, I realize that perhaps its format is a distraction from the goal of the project, and not something you want to be fighting with as we being a series on using and writing APIs. We will begin to transition away from the watch data in this project, and instead use movie/tv-related data, which will be a _lot_ more fun to write an API for (hopefully).
-====
-
-=== Question 1
-
-[CAUTION]
-====
-If you did not complete project (6), you should go back and complete question (1) and question (2) before continuing. Don't worry, you just need to follow the instructions, there is no critical thinking for those 2 questions. If you get stuck, just write in Piazza.
-====
-
-As alluded to in question (5) from the previous project, in this project, we will put to work what you learned from the previous project!
-
-First, if you haven't already, create a `firstname-lastname-project07.ipynb` file in your `$HOME` directory.
-
-Review and read the content https://guides.github.com/introduction/flow/[here] on GitHub flow. GitHub flow is a workflow or pattern that you can follow that will help you work on the same codebase, with many others, at the same time. You may see where this is going, and it is a little crazy, but let's give this a try and see what happens.
-
-In this project, we will be "collaborating" with every other 39000 student, but mostly with me. Normally, you would all have explicit permissions in GitHub to work on and collaborate on the repositories in a given organization. For example, if I added you all to TheDataMine GitHub organization, you could simply clone the repository, create your own branch, make modifications, and push the branch up to GitHub. Unfortunately, since _technically_ you aren't in TheDataMine GitHub organization, you can't do that. Instead, you need to _fork_ our repository, clone your fork of the repository, create your own branch, make modifications, and push the branch up to GitHub. Just follow the instructions provided and it will be fine!
-
-Start by forking our repository. In a browser, navigate to https://github.com/TheDataMine/f2021-stat39000-project7, and in the upper right-hand corner, click the "Fork" button.
-
-[IMPORTANT]
-====
-Make sure you are logged in to GitHub before you fork the repository!
-====
-
-image::figure15.webp[Fork the repository, width=792, height=500, loading=lazy, title="Fork the repository"]
-
-This will create a _fork_ of our original repository in _your_ GitHub account. Now, we want to clone _your_ fork of the repo!
-
-Clone your fork into your `$HOME` directory:
-
-- YourUserName/f2021-stat39000-project7
-
-[IMPORTANT]
-====
-Replace "YourUserName" with your GitHub username.
-====
-
-[NOTE]
-====
-Sometimes, repositories will be shown as GitHubOrgName/RepositoryName or GitHubUserName/RepositoryName. The repos will be located at https://github.com/GitHubOrgName/RepositoryName and https://github.com/GitHubUserName/RepositoryName, respectively. When using SSH (which we are) to clone those repos, the strings would be git@github.com:GitHubOrgName/RepositoryName.git and git@github.com:GitHubUserName/RepositoryName.git, respectively.
-
-What does SSH vs HTTPS mean? Read https://docs.github.com/en/get-started/getting-started-with-git/about-remote-repositories[here] for more information. When cloning a repo using HTTPS, it will look something like:
-
-[source,bash]
-----
-git clone https://github.com/user/repo.git
-----
-
-When cloning a repo using SSH, it will look something like:
-
-[source,bash]
-----
-git clone git@github.com:user/repo.git
-----
-
-Both work fine, but I've had fewer issues with the latter, so that is what we will stick to for now.
-====
-
-[IMPORTANT]
-====
-Make sure to run the clone command in a bash cell in your `firstname-lastname-project07.ipynb` file.
-====
-
-[NOTE]
-====
-The result of cloning the repository will be a directory called `f2021-stat39000-project7` in your `$HOME` directory. Due to the nature of this project, your cloned repo may contain other students' code, if their code has been merged into the `main` branch -- cool!
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Let's test things out to make sure they are working the way we intended. First, we can see that there is a `pyproject.toml` file and a `poetry.lock` file. Let's use poetry to build our virtual environment to run and test our code.
-
-In a bash cell in your notebook, run the following:
-
-[source,ipython]
-----
-%%bash
-
-module unload python/f2021-s2022-py3.9.6
-cd $HOME/f2021-stat39000-project7
-poetry install
-----
-
-[NOTE]
-====
-Recall that the `module unload` command is only needed due to the way we have things configured on Brown -- _typically_ it would be much more straightforward, and we would just run `poetry install`.
-====
-
-Great! Now, in the next bash cell, test out things by running the `runme.py` script.
-
-[source,ipython]
-----
-%%bash
-
-# unload the module
-module unload python/f2021-s2022-py3.9.6
-
-# give execute permissions to the runme.py script
-chmod 700 $HOME/f2021-stat39000-project7/scripts/runme.py
-
-# navigate to inside the project directory (this is needed because your notebook is in your $HOME directory)
-cd $HOME/f2021-stat39000-project7
-
-# run the runme.py script using our environment
-poetry run python scripts/runme.py
-----
-
-If all went well, you should see something **similar** to the following output.
-
-.Output
-----
-Pandas is here!: /home/kamstut/f2021-stat39000-project7/.venv/lib/python3.9/site-packages/pandas/__init__.py
-^^^^^^^
-If that doesnt start with something like "$HOME/f2021-stat39000-project7/.venv/..., you did something wrong
-IMDB data from: /depot/datamine/data/movies_and_tv/imdb.db
-8.2
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Okay, great! So far, so good.
-
-As a very important contributor to our new package, you will be adding a method to our `IMDB` class. This method should use the `aiosql` package to run a query (or more than one query) against the `imdb.db` database, and return some data or do something cool. As an alternative, your method could also do some sort of web scraping for IMDB. Your new method _must_ include a Google style docstring, and _must_ be non-trivial -- for example a method that returns the rating of a title or the name of a title is too simple. Any valid effort will be awarded full credit.
-
-[WARNING]
-====
-Before continuing, let's follow the https://guides.github.com/introduction/flow/[first step] of the GitHub flow, and create our own branch to work on and commit changes to. Create a new branch called `firstname-lastname` from the `main` branch. Once created, _checkout_ the branch so it is your active branch.
-====
-
-[WARNING]
-====
-Remember that the `git` commands should be run _inside_ the project folder, `$HOME/f2021-stat39000-project7`. Since our Jupyter notebook, `firstname-lastname-project07.ipynb`, is in the `$HOME` directory, we need to `cd` into the project directory before we can run the `git` commands, for **every** bash cell in our notebook (except for the bash cell where we are cloning the repository). To make it explicitly clear, every bash cell in your notebook that isn't cloning the repo should have:
-
-[source,bash]
-----
-cd $HOME/f2021-stat39000-project7
-----
-
-_Before_ you run the `git` commands.
-====
-
-Please take a look at the `get_rating` method in the `imdb.py` module for an example of a method.
-
-Please take a look at the `imdb_queries.sql` file, to see how a query is written using this package. https://nackjicholson.github.io/aiosql/defining-sql-queries.html[Here] is the official documentation for `aiosql`.
-
-[NOTE]
-====
-Note that since we will _just_ be reading from the database, you will want to limit yourself to queries that are "Select One" (ending in a "^"), or "Select Value" (ending in a "$"), or "No Operator" (ending in no symbol).
-====
-
-Please take a look at `runme.py` to see how we used the `tdm_media` package.
-
-To make these additions to the package you will need to:
-
-. Modify the `imdb.py` module to add the new method.
-+
-[WARNING]
-====
-For simplicity, call your new method `firstname_lastname` in the `imdb.py` module. Where you would replace `firstname` and `lastname` with your first and last name, respectively.
-====
-+
-[NOTE]
-====
-If you want to have examples of `title_id` values and `person_id` values, look no further than https://imdb.com! For example, let's say I want Peter Dinklage's person_id -- to get this, all I have to do is search for him on the IMDB website. I will be sent to a link similar to the following.
-
-https://www.imdb.com/name/nm0227759
-
-Here, you can see Peter Dinklage's person_id in the URL itself! It is "nm0227759".
-
-Same for title_ids -- simply search for the movie or tv show or tv show episode you are curious about, and the `title_id` will be right in the URL.
-====
-. Modify the `imdb_queries.sql` file to add any new queries you need in order to get your `firstname_lastname` method working.
-+
-[WARNING]
-====
-For simplicity, call your new queries `firstname_lastname_XX` in the `imdb_queries.sql` file. Where you would replace `firstname` and `lastname` with your first and last name, respectively, and you would replace `XX` with a counter like `01`, `02`, etc.
-
-For example, if I had two queries my additions would look something like this:
-
-.imdb_queries.sql
-[source,sql]
-----
--- name: kevin_amstutz_01$
--- Get the rating of the movie/tv episode/short with the given id
-SELECT rating FROM ratings WHERE title_id = :title_id;
-
--- name: kevin_amstutz_02$
--- Get the rating of the movie/tv episode/short with the given id
-SELECT rating FROM ratings WHERE title_id = :title_id;
-----
-====
-+
-. Create a new script in the scripts directory called `firstname_lastname.py`.
-+
-[TIP]
-====
-The following is some boilerplate code for your `firstname_lastname.py` script.
-
-[source,python]
-----
-import sys
-from pathlib import Path
-sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
-
-from tdm_media.imdb import IMDB
-import pandas as pd
-
-def main():
-
- dat = IMDB("/depot/datamine/data/movies_and_tv/imdb.db")
-
- # code to use your method here, for example:
- print(dat.get_rating("tt5180504"))
-
-if __name__ == '__main__':
- main()
-----
-====
-+
-. Finally, if your new method uses a library not already included in our environment, you will need to install it.
-+
-[TIP]
-====
-To add the library (if and only if it is needed):
-
-[source,ipython]
-----
-%%bash
-
-module unload python/f2021-s2022-py3.9.6
-cd $HOME/f2021-stat39000-project7
-poetry add thedatamine
-----
-
-Replace "thedatamine" with the name of the package you need.
-====
-
-Great! Once you've made these modifications, in a bash cell, run your new script and see if the output is what you expect it to be!
-
-[source,ipython]
-----
-%%bash
-
-# unload the module
-module unload python/f2021-s2022-py3.9.6
-
-# give execute permissions to the firstname_lastname.py script
-chmod 700 $HOME/f2021-stat39000-project7/scripts/firstname_lastname.py
-
-# navigate to inside the project directory (this is needed because your notebook is in your $HOME directory)
-cd $HOME/f2021-stat39000-project7
-
-# run the firstname_lastname.py script using our environment
-poetry run python scripts/firstname_lastname.py
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Fantastic! We have implemented our new things, and we are ready to continue with the GitHub flow!
-
-In a bash cell, navigate to the root of the project directory, `$HOME/f2021-stat39000-project7`, and stage any new files you've created that you would like to commit.
-
-[source,ipython]
-----
-%%bash
-
-cd $HOME/f2021-stat39000-project7
-git add .
-----
-
-Excellent! Now, _commit_ the new files and changes. Be sure to include a commit message that describes what you've done.
-
-[TIP]
-====
-Using `git commit` requires having a message with your commit! To add a message, simply use the `-m` flag. So for example.
-
-[source,bash]
-----
-git commit -m "This is my fantastic new function."
-----
-====
-
-[NOTE]
-====
-Normally, you'd add and commit files and changes as you are writing the code. However, since this is all so new, we set this up so you just add and commit all at once.
-====
-
-The next step in the GitHub flow would be to open a pull request. First, before we do that, we have to _push_ the changes we've made locally, on Brown, to our _remote_ (GitHub). To do this, in a bash cell, run the following command:
-
-[source,ipython]
-----
-%%bash
-
-cd $HOME/f2021-stat39000-project7
-git push --set-upstream origin firstname-lastname
-----
-
-[IMPORTANT]
-====
-Replace firstname-lastname with your first and last name, respectively. It is the name of your branch you created in question (3).
-====
-
-Once run, if you navigate to your fork's GitHub page, https://github.com/YourUserName/f2021-stat39000-project7, you should be able to refresh the webpage and see your new branch in the dropdown menu for branches.
-
-image::figure07.webp[Looking at the branches, width=792, height=500, loading=lazy, title="Looking at the branches"]
-
-Awesome! Okay, now you are ready to open a pull request. A pull request needs to be opened in the browser. Navigate to the project page https://github.com/YourUserName/f2021-stat39000-project7, click on the "Pull requests" tab, then click on "New pull request".
-
-We want to create a pull request that merges your branch, `firstname-lastname`, into the `main` branch. Select your branch from the menu on the right side of the left arrow, and click "Create pull request".
-
-image::figure08.webp[Selecting what to merge, width=792, height=500, loading=lazy, title="Selecting what to merge"]
-
-image::figure09.webp[Screen when selected, width=792, height=500, loading=lazy, title="Screen when selected"]
-
-Enter the important information in the boxes. Describe what your function does, and why you want to merge it into the main branch. Once satisfied, in a comment box, write something like "@kevinamstutz Could you please review this?".
-
-image::figure10.webp[Filling out the pull request, width=792, height=500, loading=lazy, title="Filling out the pull request"]
-
-Click "Create pull request", and you should see a screen similar to the following.
-
-image::figure11.webp[Resulting screen, width=792, height=500, loading=lazy, title="Resulting screen"]
-
-Write back and forth with me at least once, and when you are good to go, I will write back and merge the PR.
-
-Take a screenshot of the final result, after the PR is merged.
-
-image::figure12.webp[Final result, width=792, height=500, loading=lazy, title="Final result"]
-
-[IMPORTANT]
-====
-If I do not respond back and merge fast enough, it is OK to take a screenshot of the non-merged pull request page -- you will receive full credit. Try to wait though! I'm usually pretty quick!
-====
-
-Upload the screenshot to your `$HOME` directory, and include them using a markdown cell.
-
-[TIP]
-====
-To include the image in a markdown cell, do the following. The following assumes your image is called `myimage.png` and is located in your `$HOME` directory. It also assumes your notebook is in the `$HOME` directory.
-
-[source,ipython]
-----
-![](./myimage.png)
-----
-
-Then, run the cell! Your image will appear.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-Okay, what files should you submit for this project? Please submit the following:
-
-- `firstname-lastname-project07.ipynb` (your notebook).
-- Your modified `imdb.py`, with your `firstname_lastname` method.
-- Your modified `imdb_queries.sql` file with your added query(s).
-- Your script, `firstname_lastname.py`, that uses your `firstname_lastname` method.
-====
-
-=== Question 5 (optional, 0 pts)
-
-You've now worked through the entire GitHub flow! That is really great! It definitely can take some time getting used to. If you have the time, and are feeling adventurous, and _excellent_ test of your skills would be to add something to this book! Clone this repository (git@github.com:TheDataMine/the-examples-book.git), add some content, and create a pull request!
-
-You can add a UNIX, R, Python, or SQL example, no problem! At some point in time, I'll review your addition and you will be an official contributor to the book! Why not?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project08.adoc
deleted file mode 100644
index 03d0bd994..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project08.adoc
+++ /dev/null
@@ -1,411 +0,0 @@
-= STAT 39000: Project 8 -- Fall 2021
-
-**Motivation:** Python is an incredible language that enables users to write code quickly and efficiently. When using Python to write code, you are _typically_ making a tradeoff between developer speed (the time in which it takes to write a functioning program or scripts) and program speed (how fast your code runs). This is often the best choice depending on your staff and how much your software developers or data scientists earn. However, Python code does _not_ have the advantage of being able to be compiled to machine code for a certain architecture (x86_64, ARM, Darwin, etc.), and easily shared. In Python you need to learn how to use virtual environments (and git) to share your code.
-
-**Context:** This is the last project in a series of 3 projects that explores how to setup and use virtual environments, as well as some `git` basics. In addition, we will use this project as a transition to learning about APIs.
-
-**Scope:** Python, virtual environments, git, APIs
-
-.Learning Objectives
-****
-- Explain what a virtual environment is and why it is important.
-- Create, update, and use a virtual environment to run somebody else's Python code.
-- Use git to create a repository and commit changes to it.
-- Understand and utilize common `git` commands and workflows.
-- Understand and use the HTTP methods with the `requests` library.
-- Differentiate between graphql, REST APIs, and gRPC.
-- Write REST APIs using the `fastapi` library to deliver data and functionality to a client.
-- Identify the various components of a URL.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/whin/whin.db`
-
-== Questions
-
-[NOTE]
-====
-We are _so_ lucky to have great partners in the Wabash Heartland Innovation Network (WHIN)! They generously provide us with access to their API (https://data.whin.org/[here]) for educational purposes. You've most likely either used their API in a previous project, or you've worked with a sample of their data to solve some sort of data-driven problem.
-
-You can learn more about WHIN at https://whin.org/[here].
-
-In this project, we are providing you with our own version of the WHIN API, so you can take a look under the hood, modify things, and have a hands-on experience messing around with an API written in Python! Behind the scenes, our API is connecting to a sqlite database that contains a small sample of the rich data that WHIN provides.
-====
-
-=== Question 1
-
-In a https://thedatamine.github.io/the-examples-book/projects.html#p09-290[previous project], we used the `requests` library to build a CLI application that made calls to the WHIN API.
-
-Our focus in _this_ project will be to study the WHIN API (and other APIs), with the goal of learning about the components of an API in a hands-on manner.
-
-Before we _really_ dig in, it is well worth our time to do some reading. There is a _lot_ of information online about APIs. There are a _lot_ of opinions on proper API design.
-
-[NOTE]
-====
-At no point in time will we claim that the way we are going to design our API is the best way to do it. However, we will try and learn from some of the most successful commercial APIs, mainly, the https://stripe.com/docs/api[Stripe API].
-====
-
-First thing is first, let's clone our _homage_ to the WHIN API to prevent confusion, we will refer to this as **our** API. Run the following in a bash cell.
-
-[source,ipython]
-----
-%%bash
-
-cd $HOME
-git clone git@github.com:TheDataMine/f2021-stat39000-project8.git
-----
-
-Then, install the Python dependencies for this project by running the following code in a new bash cell.
-
-[source,ipython]
-----
-%%bash
-
-module unload python/f2021-s2022-py3.9.6
-cd $HOME/f2021-stat39000-project8
-poetry install
-----
-
-Finally, let's see if we can _run_ this API on Brown! To do this, we will _not_ be running the API via a bash cell in Jupyter Lab. Instead, we will pop open a terminal, and have it running in another tab.
-
-Create a file called `.env` (that's it, no extension -- just a file called `.env` with the following text, in a single line) inside your `f2021-stat39000-project8` directory, with the following content.
-
-----
-DATABASE_PATH=/depot/datamine/data/whin/whin.db
-----
-
-[NOTE]
-====
-A file starting with a period is a _hidden_ file. In UNIX-like systems, you need to add `-a` to the `ls` command to see hidden files.
-
-Give it a try:
-
-[source,bash]
-----
-ls -la $HOME/f2021-stat39000-project8
-----
-====
-
-Then, open a new terminal tab. Click on the blue "+" button in the top left corner of the Jupyter Lab interface.
-
-image::figure16.webp[Create new Terminal tab, width=792, height=500, loading=lazy, title="Create new Terminal tab"]
-
-Then, on your kernel selection screen, scroll down until you see the "Terminal" box. Select it to launch a fresh terminal on Brown.
-
-image::figure17.webp[Select Terminal, width=792, height=500, loading=lazy, title="Select Terminal"]
-
-The command to run the API is as follows.
-
-[source,bash]
-----
-module use /scratch/brown/kamstut/tdm/opt/modulefiles
-module load poetry/1.1.10
-cd $HOME/f2021-stat39000-project8
-poetry run uvicorn app.main:app --reload
-----
-
-Now, with that being said, it is not _quite_ so simple. We are running this API on Brown, a community cluster with _lots_ of other users, running _lots_ of other applications. By default, fastapi will run on local port 8000. What this means is that if you were on your personal computer, you could pop open a browser and navigate to `http://localhost:8000/` to see the API. The problem _here_ is you _each_ need to be running your API on your _own_ port -- and it is very likely port 8000 is already in use.
-
-So what are we going to do? Well, one option is to just choose a number, and run your API with _this_ command.
-
-[source,bash]
-----
-module use /scratch/brown/kamstut/tdm/opt/modulefiles
-module load poetry/1.1.10
-cd $HOME/f2021-stat39000-project8
-poetry run uvicorn app.main:app --reload --port XXXXX
-----
-
-Where XXXXX is a number generated using the command below. In a bash cell, run the following code.
-
-[source,bash]
-----
-port
-----
-
-.Output
-----
-21650 # your number may be different!
-----
-
-[IMPORTANT]
-====
-You _must_ run this in a bash cell. This bash script lives in the `/scratch/brown/kamstut/tdm/bin` directory, which is _automatically_ added to your `$PATH` in our Jupyter Lab environment.
-====
-
-Then, given your _available_ port number, run the following from your terminal tab.
-
-[source,bash]
-----
-module use /scratch/brown/kamstut/tdm/opt/modulefiles
-module load poetry/1.1.10
-cd $HOME/f2021-stat39000-project8
-poetry run uvicorn app.main:app --reload --port 21650 # if your port was 1111 you'd replace 21650 with 1111
-----
-
-[IMPORTANT]
-====
-Replace 21650 with the port number from your `port` command you ran earlier. Every time you see 21650 in this project, replace it with **your** port number.
-====
-
-Once successful, you should see text _similar_ to the following.
-
-----
-INFO: Will watch for changes in these directories: ['$HOME/f2021-stat39000-project8']
-INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
-INFO: Started reloader process [94978] using watchgod
-INFO: Started server process [94997]
-INFO: Waiting for application startup.
-INFO: Application startup complete.
-----
-
-Then, to _see_ the API, or the responses, _normally_ you could just navigate to http://localhost:21650, and enter the URLs there. By default, the browser will GET those responses. Since our compute environment is a little bit more complicated, we will limit GET'ing our responses using the `requests` package.
-
-Run the following in a cell.
-
-[source,python]
-----
-import requests
-
-response = requests.get("http://localhost:21650")
-print(response.json())
-----
-
-You should be presented with an _extremely_ boring result -- a simple "hello world". Yay! You are running an API and even made a GET request to that API using the `requests` package. While this may or may not seem too cool to you, it is pretty awesome! I _hope_ these next few projects will be fun for you!
-
-[NOTE]
-====
-Please send any feedback you may have to kamstut@purdue.edu/mdw@purdue.edu/datamine@purdue.edu. This is the _first_ time we are testing out these project ideas, so any feedback -- positive or negative -- is welcome! I've already made a lot of notes to make some of the earlier projects less time consuming. We ultimately want to make these projects fun, give you some exposure to cool techniques used in industry, and hopefully make you a better programmer/statistician/nurse/whathaveyou. With that being said, I have definitely missed the mark many times, and your feedback helps a lot.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Great! Now, you have **our** API running on Brown. Now its time to learn about what the heck an API is. There are a _lot_ of different types of APIs. The most common used today are RESTful APIs (what we will be focusing on, probably the most popular), graphQL APIs, and gRPC APIs.
-
-https://www.redhat.com/architect/apis-soap-rest-graphql-grpc[This] is a decent article highlighting the various types of APIs (feel free to skip the antiquated SOAP). Summarize the 3 mentioned APIs (RESTful, gRPC, and graphQL) in 1-2 sentences, and write at least 1 pro and 1 con of each.
-
-As I mentioned before, it makes the most sense to focus on RESTful APIs at this point in time, however, gRPC and graphQL have some serious advantages that make them very popular in industry. It is likely you will run into some of these in your future work.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Since it is not so straightforward to pull up the _automatically_ generated, interactive, API documentation, we've provided a screenshot below.
-
-image::figure18.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"]
-
-image::figure19.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"]
-
-image::figure20.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"]
-
-image::figure21.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"]
-
-image::figure22.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"]
-
-image::figure23.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"]
-
-image::figure24.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"]
-
-image::figure25.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"]
-
-Awesome! There are some pretty detailed docs that we incorporated.
-
-Let's make a _request_ to our API. Once we make a _request_ to our API, we will receive a _response_ back. The main components of a request are:
-
-- The _method_ (GET, POST, PUT, DELETE, etc.)
-- The _path_ (the URL path)
-- The _headers_ (the HTTP headers)
-- The _body_ (the data that is sent in the request)
-
-Thats it!
-
-The only method we will talk about in this project is the GET method. If you want a list of methods, simply Google "HTTP methods" and you should find a list of all the methods.
-
-The GET method is the same method that browsers primarily utilize when they navigate to a website. They GET the website content.
-
-The _path_ starts after the URL. In our case, the path was `/docs/` to get the docs! The path highlights the resource we are trying to access.
-
-The _headers_ are sent with the request and can be used for a wide variety of things. For example, in the next question, we will use a header to authenticate with the _real_ WHIN API and make a request.
-
-Finally, the _body_ is the data that is sent with the request. In our case, we will not be sending any data with our request, instead, we will be receiving data in the body of our _response_.
-
-To make a response to our API, we can use the `requests` package. Run the following in a Python cell.
-
-[source,python]
-----
-import requests
-
-response = requests.get('http://localhost:21650/stations/')
-----
-
-`response` will then contain your -- response! If you look over in your terminal tab, you will see that **our** API logged the request we made.
-
-The response will contain a status code. You can see a list of status codes, and what they mean https://developer.mozilla.org/en-US/docs/Web/HTTP/Status[here].
-
-To get the status code from your `response` variable, try the following.
-
-[source,python]
-----
-response.status_code
-----
-
-Run the following to get a list of the methods and attributes available to you with the response object.
-
-[source,python]
-----
-dir(response)
-----
-
-You can see a lot -- this is a useful "trick" in python. Alternatively, like most dunder methods, you could also run the following.
-
-[source,python]
-----
-response.__dir__()
-----
-
-This is the same as:
-
-[source,python]
-----
-dir(response)
-----
-
-Okay, great!
-
-You can get the headers like this:
-
-[source,python]
-----
-response.headers
-----
-
-You can get the pure text of the response like this:
-
-[source,python]
-----
-response.text
-----
-
-Finally, to the the JSON formatted body of the response, you can use the json method, which will return a list of dicts containing the data!
-
-[source,python]
-----
-response.json() # the open and closed parenthesis are important. `json()` is a method not an attribute (like `.text`), so the parentheses are important.
-----
-
-As you _may_ have ascertained, the endpoint, `http://localhost:21650/stations/`, will return a list of station objects -- very cool!
-
-In another tab in your regular browser running on your local machine, navigate to the https://data.whin.org/data/current-conditions[official WHIN api docs] (you may need to login). Follow the directions at the beginning of https://thedatamine.github.io/the-examples-book/projects.html#p09-290[this project] to be able to authenticate with the WHIN API (questions 1 _and_ 2).
-
-Next, make sure you followed the instructions in question (2) from https://thedatamine.github.io/the-examples-book/projects.html#p09-290[this project] and that your `.env` file now contains something like:
-
-..env file
-----
-DATABASE_PATH=/depot/datamine/data/whin/whin.db
-MY_BEARER_TOKEN=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VyIjp7ImlkIjo5LCJkaXsdgw3ret234gBbXN0dXR6IiwiYWNjb3VudF90eXBlIjoiZWR1Y2F0aW9uIn0sImlhdCI6MTYzNDgyMzUyOSwibmJmIjoxNjM0ODIzNTI5LCJleHAiOjE2NjYzNTk1MjksImlzcyI6Imh0dHBzOi8vd2hpbi5vcmcifQ.LASER2vFONRhkdrPtEwca0eGxCtbjJ4Btaurgerg7l27z_Rwqhy1gghdFpscLFkFzfVw7VUdV_hlJ1rzmHi8i75hcLEUL18T76kdY82yb7Q8b_YTB32iQnJDP3uVQP5sQWs5mv8HcEj6W7jNX5HQe-iItzBXVAcMBUmR0SK9Pt2JRmCbuHpM242JJqwBvEMZw1mjNWGs70c595QqyxaUtgrSSmMBbZQeaN21U9EuSEjUKBRgtjl-9t-IhLkLVNo008Vq4v-sA
-----
-
-If you are having a hard time adding another line to your `.env` file, you can also run the following in a bash cell to _append_ the line to your `.env` file. **Make sure you replace the token with _your_ token.**
-
-[source,bash]
-----
-echo "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VyIjp7ImlkIjo5LCJkaXsdgw3ret234gBbXN0dXR6IiwiYWNjb3VudF90eXBlIjoiZWR1Y2F0aW9uIn0sImlhdCI6MTYzNDgyMzUyOSwibmJmIjoxNjM0ODIzNTI5LCJleHAiOjE2NjYzNTk1MjksImlzcyI6Imh0dHBzOi8vd2hpbi5vcmcifQ.LASER2vFONRhkdrPtEwca0eGxCtbjJ4Btaurgerg7l27z_Rwqhy1gghdFpscLFkFzfVw7VUdV_hlJ1rzmHi8i75hcLEUL18T76kdY82yb7Q8b_YTB32iQnJDP3uVQP5sQWs5mv8HcEj6W7jNX5HQe-iItzBXVAcMBUmR0SK9Pt2JRmCbuHpM242JJqwBvEMZw1mjNWGs70c595QqyxaUtgrSSmMBbZQeaN21U9EuSEjUKBRgtjl-9t-IhLkLVNo008Vq4v-sA" >> $HOME/f2021-stat39000-project08/.env
-----
-
-[IMPORTANT]
-====
-You must replace the "MY_BEARER_TOKEN" with **your** token from https://data.whin.org/account[this page].
-====
-
-When configured, make the following request.
-
-[source,python]
-----
-import requests
-import os
-from dotenv import load_dotenv
-
-load_dotenv(os.getenv("HOME")+"/f2021-stat39000-project8/.env")
-
-my_headers = {"Authorization": f"Bearer {os.getenv('MY_BEARER_TOKEN')}"}
-response = requests.get("https://data.whin.org/api/weather/stations", headers = my_headers)
-print(response.json())
-----
-
-You'll find that the responses are very similar -- but of course, ours is just a sample of theirs.
-
-Notice that the response is pretty long, but it is a _list_ of dictionaries, so we can easily print the first 5 values only, like this.
-
-[source,python]
-----
-print(response.json()[:5])
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-You've successfully made a _request_ to both **our** API (which you are running in the terminal tab), and the WHIN API -- very cool!
-
-Read the documentation provided for **our** API in the screenshots in question (3), and make a request with a _query parameter_. A _query parameter_ is a parameter that is added to the URL itself. _Query parameters_ are added to the end of a URL. They start with a "?", have key, value pairs separated by "=", and many can be strung together using "&" to separate them. For example.
-
-----
-http://localhost:21650/some_endpoint?queryparam1key=queryparam1value&queryparam2key=queryparam2value
-----
-
-Here, we have 2 query parameters, `queryparam1key` and `queryparam2key`, and their values are `queryparam1value` and `queryparam2value`, respectively.
-
-In **our** API, there are a few endpoints that give you optional query parameters (see the images in question (3)) -- use the `requests` library to test it out and make a request involving at least 1 query parameter with any of the endpoints we provide with **our** API.
-
-Now, try and replicate the request using the original WHIN API -- were you able to fully replicate it?
-
-[NOTE]
-====
-When we ask "were you able to fully replicate it", all we want to know is if the WHIN API happens to provide the same functionality.
-====
-
-The APIs are pretty different, and provide different functionalities. APIs are not the same, and depending on the purpose of you API, you may build it differently! Very cool!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Make a new request to **our** API, and use at least 2 query parameters in your request -- do the results make sense based on what you've read on the docs? Why or why not?
-
-In websites, a common feature is _pagination_ -- the ability to page through lots of results, one page at a time. Often times this will look like a "Next" and "Previous" button in a webpage. Which of the query parameters would be useful for pagination in our API and why?
-
-Finally, make a new request to the original WHIN API. Specifically, try and test out the very cool `current-conditions` endpoint that allows you to zone in on stations near a certain latitude and longitude location. Can you replicate this with our API, or do we not have that capability baked in?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project09.adoc
deleted file mode 100644
index 7b8dd3d13..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project09.adoc
+++ /dev/null
@@ -1,522 +0,0 @@
-= STAT 39000: Project 9 -- Fall 2021
-
-**Motivation:** One of the primary ways to get and interact with data today is via APIs. APIs provide a way to access data and functionality from other applications. There are 3 very popular types of APIs that you will likely encounter in your work: RESTful APIs, GraphQL APIs, and gRPC APIs. We will address some pros and cons of each, with a focus on the most ubiquitous, RESTful APIs.
-
-**Context:** This is the second in a series of 4 projects focused around APIs. We will learn some basics about interacting and using APIs, and even build our own API.
-
-**Scope:** Python, APIs, requests, fastapi
-
-.Learning Objectives
-****
-- Understand and use the HTTP methods with the `requests` library.
-- Differentiate between graphql, REST APIs, and gRPC.
-- Write REST APIs using the `fastapi` library to deliver data and functionality to a client.
-- Identify the various components of a URL.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/movies_and_tv/imdb.db`
-
-In addition, the following is an illustration of the database to help you understand the data.
-
-image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"]
-
-== Questions
-
-=== Question 1
-
-Begin this project by cloning our repo and installing the required packages. To do so, run the following in a bash cell.
-
-[source,ipython]
-----
-%%bash
-
-cd $HOME
-git clone git@github.com:TheDataMine/f2021-stat39000-project9.git
-----
-
-Then, to install the required packages, run the following in a bash cell.
-
-[source,ipython]
-----
-%%bash
-
-module unload python/f2021-s2022-py3.9.6
-cd $HOME/f2021-stat39000-project9
-poetry install
-----
-
-Next, let's identify a port that we can run our API on. In a bash cell, run the following.
-
-[source,ipython]
-----
-%%bash
-
-port
-----
-
-You will get a port number, like the following, for example.
-
-.Output
-----
-1728
-----
-
-From this point on, when we mention the port 1728, please replace it with the port number you were assigned. Open a new terminal tab so that we can run our API, alongside our notebook.
-
-Next, you'll need to add a `.env` file to your `f2021-stat39000-project9` directory, with the following content. (Pretty much just like the previous project!)
-
-----
-DATABASE_PATH=/depot/datamine/data/movies_and_tv/imdb.db
-----
-
-In **your terminal (not a bash cell)**, run the following.
-
-[source,bash]
-----
-module use /scratch/brown/kamstut/tdm/opt/modulefiles
-module load poetry/1.1.10
-cd $HOME/f2021-stat39000-project9
-poetry run uvicorn app.main:app --reload --port 1728
-----
-
-Upon success, you should see some output similar to:
-
-.Output
-----
-INFO: Will watch for changes in these directories: ['$HOME/f2021-stat39000-project9']
-INFO: Uvicorn running on http://127.0.0.1:1728 (Press CTRL+C to quit)
-INFO: Started reloader process [25005] using watchgod
-INFO: Started server process [25008]
-INFO: Waiting for application startup.
-INFO: Application startup complete.
-----
-
-Fantastic! Leave that running in your terminal, and test it out with the following request in a regular Python cell in your notebook.
-
-[CAUTION]
-====
-Make sure to replace 1728 with the port number you were assigned.
-====
-
-[source,python]
-----
-import requests
-resp = requests.get("http://localhost:1728")
-print(resp.json())
-----
-
-You should receive a Hello World message, great!
-
-[TIP]
-====
-Throughout this project, be patient waiting for your requests to complete -- sometimes they take a while. If it is taking too long, you can always try killing the server. To do so, open the terminal tab and hold ctrl and press c. This will kill the server. Once killed, just restart it using the same command you used previously to start it.
-
-Finally, there are now 2 places to check for errors and print statements: the terminal and the notebook. When you get an error be sure to check both for useful clues! Keep in mind that you only need to modify 3 files: `main.py`, `queries.sql`, and `imdb.py` (plus making the requests in your notebook). Don't worry about any of the other files, but feel free to look around if you want!
-====
-
-[TIP]
-====
-Please test the requests in your notebook with the code we provide you. We've tested them and know that they work. If you choose to test them with a different movie/tv show/etc., you could get unexpected errors related to our `schemas.py` file -- best just to stick to the requests we provide.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Okay, so the goal of the next 4 or so questions is to put together the following API endpoints, that return simple JSON responses, with the desired data. You can almost think of this as one big fancy interface to return data from our database in JSON format -- that _is_ pretty much what it is! BUT we have the capability to do nice data-processing on the data _before_ it is returned, which can be difficult using _just_ SQL.
-
-The following are a list of endpoints that we _already_ have implemented for you, to help get you started.
-
-- `http://localhost:1728/movies/{title_id}`
-
-[NOTE]
-====
-Here the `{title_id}` portion represents a _path parameter_. https://stackoverflow.com/questions/30967822/when-do-i-use-path-params-vs-query-params-in-a-restful-api[Here] is a good discussion on when you should choose to design your API with a path parameter vs. a query parameter. The top answer is really good.
-
-To be very clear, the following would be an example making a request to the `/movies/{title_id}` endpoint.
-
-[source,python]
-----
-import requests
-
-response = requests.get("http://localhost:1728/movies/tt0076759")
-print(response.json())
-----
-====
-
-The following are a list of endpoints we want _you_ to build!
-
-- `http://localhost:1728/cast/{title_id}`
-- `http://localhost:1728/tv/{title_id}`
-- `http://localhost:1728/tv/{title_id}/seasons/{season_number}/episodes/{episode_number}`
-- `http://localhost:1728/tv/{title_id}/seasons/{season_number}/episodes` (optional)
-
-The following are a list of endpoints that we will provide you in project 10.
-
-- `http://localhost:1728/tv/{title_id}/seasons`
-- `http://localhost:1728/tv/{title_id}/seasons/{season_number}`
-
-This will be a very guided project, so please be sure to read the instructions carefully, and as you are working, use your imagination to imagine what other cool potential and possibilities building APIs can have! We are only scratching the surface here!
-
-Okay, let's get started with the first endpoint.
-
-- `http://localhost:1728/cast/{title_id}`
-
-Implement this endpoint. What files do you need to modify?
-
-- Add the following function to `main.py`
-+
-[source,python]
-----
-@app.get(
- "/cast/{title_id}",
- response_model=list[CrewMember],
- summary="Get the crew for a title_id.",
- response_description="A crew."
-)
-async def get_cast(title_id: str):
- cast = get_cast_for_title(title_id)
- return cast
-----
-+
-- Add the following query to `queries.sql`, filling in the query
-+
-----
--- name: get_cast_for_title
--- Get the cast for a given title
-SELECT statement here
-----
-+
-[IMPORTANT]
-====
-Make sure you don't add the carrot "^" to the end of this particular query. Otherwise, it will only return 1 result.
-====
-+
-[TIP]
-====
-In your `queries.sql` file, anything starting with a colon is a placeholder for a variable you will pass along. Check out the `imdb.py` file and the `queries.sql` file to better understand.
-====
-+
-- In your `imdb.py` mondule, fill out the skeleton function called `get_cast_for_title`, that returns a list of `CrewMember` objects.
-+
-[TIP]
-====
-Here is the function you can finish writing:
-
-[source,python]
-----
-def get_cast_for_title(title_id: str) -> list[CrewMember]:
- # Get the cast for the movie, and close the database connection
- conn = sqlite3.connect(database_path)
- results = queries.get_cast_for_title(conn, title_id = title_id)
- conn.close()
-
- # Create a list of dictionaries, where each dictionary is a cast member
- # INITIALIZE EMPTY LIST
- for member in results:
- crewmemberobject = CrewMember(**{key: member[i] for i, key in enumerate(CrewMember.__fields__.keys())})
- # APPEND crewmemberobject TO LIST
-
- return cast
-----
-====
-+
-[TIP]
-====
-Check out the `get_movie_with_id` function for help! It should just be a few small modifications.
-====
-
-To test your endpoint, run the following in a Python cell in your notebook.
-
-[source,python]
-----
-import requests
-resp = requests.get("http://localhost:1728/cast/tt0076759")
-print(resp.json())
-----
-
-.Output
-----
-[{'title_id': 'tt0076759', 'person_id': 'nm0000027', 'category': 'actor', 'job': None, 'characters': '["Ben Obi-Wan Kenobi"]'}, {'title_id': 'tt0076759', 'person_id': 'nm0000148', 'category': 'actor', 'job': None, 'characters': '["Han Solo"]'}, {'title_id': 'tt0076759', 'person_id': 'nm0000184', 'category': 'director', 'job': None, 'characters': '\\N'}, {'title_id': 'tt0076759', 'person_id': 'nm0000402', 'category': 'actress', 'job': None, 'characters': '["Princess Leia Organa"]'}, {'title_id': 'tt0076759', 'person_id': 'nm0000434', 'category': 'actor', 'job': None, 'characters': '["Luke Skywalker"]'}, {'title_id': 'tt0076759', 'person_id': 'nm0002354', 'category': 'composer', 'job': None, 'characters': '\\N'}, {'title_id': 'tt0076759', 'person_id': 'nm0156816', 'category': 'editor', 'job': 'film editor', 'characters': '\\N'}, {'title_id': 'tt0076759', 'person_id': 'nm0476030', 'category': 'producer', 'job': 'producer', 'characters': '\\N'}, {'title_id': 'tt0076759', 'person_id': 'nm0564768', 'category': 'producer', 'job': 'producer', 'characters': '\\N'}, {'title_id': 'tt0076759', 'person_id': 'nm0852405', 'category': 'cinematographer', 'job': 'director of photography', 'characters': '\\N'}]
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Implement the following endpoint.
-
-- `http://localhost:1728/tv/{title_id}`
-
-For this question, we will leave it up to you to figure out what files to modify in what ways.
-
-[TIP]
-====
-Check out the functions that are already implemented for help -- it will be _very_ similar! If you get an error at any step of the way, _read_ the errors -- they tell you what is missing 90% of the time -- or at least hint at it!
-
-We've provided you with skeleton functions (with comments) in `imdb.py` that you can use to get started (just fill them in).
-====
-
-[NOTE]
-====
-One of the cool things that make APIs so useful is how easy it is to share data in a structured way with others! While there is typically a bit more setup to expose the API to the public -- it is really easy to share with other people on the same system. If you and your friend were on the same node, for example, `brown-a013`, your friend could make calls to your API too!
-====
-
-To test your endpoint, run the following in a Python cell in your notebook.
-
-[source,python]
-----
-import requests
-resp = requests.get("http://localhost:1728/tv/tt5180504")
-print(resp.json())
-----
-
-Which, should return the following:
-
-.Output
-----
-{'title_id': 'tt5180504', 'type': 'tvSeries', 'primary_title': 'The Witcher', 'original_title': 'The Witcher', 'is_adult': False, 'premiered': 2019, 'ended': None, 'runtime_minutes': 60, 'genres': [{'genre': 'Action'}, {'genre': 'Adventure'}, {'genre': 'Fantasy'}]}
-----
-
-And also test with the following:
-
-[source,python]
-----
-import requests
-resp = requests.get("http://localhost:1728/tv/tt2953050")
-print(resp.json())
-----
-
-Which, should return the following:
-
-.Output
-----
-{'detail': "Title with title_id 'tt2953050' is not a tv series, it is a movie."}
-----
-
-Similarly:
-
-[source,python]
-----
-import requests
-
-response = requests.get("http://localhost:1728/tv/tt8343770")
-print(response.json())
-----
-
-Which, should return the following:
-
-.Output
-----
-{'detail': "Title with title_id 'tt8343770' is not a tv series, it is a tvEpisode."}
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Implement the following endpoint.
-
-- `http://localhost:1728/tv/{title_id}/seasons/{season_number}/episodes/{episode_number}`
-
-Okay, don't be overwhelmed! There are only 3 files to modify and add code to: `main.py`, `queries.sql`, and `imdb.py`. Aside from that, you are just making `requests` library calls to test out the API (from within your notebook).
-
-[TIP]
-====
-We've provided you with the following queries (in `queries.sql`):
-
-----
--- name: get_title_type$
--- Get the type of title, movie, tvSeries, etc.
-
--- name: get_seasons_in_show$
--- Get the number of seasons in a show
-
--- name: get_episodes_in_season$
--- Get the number of episodes in a season for a given title with given title_id
-
--- name: get_episode_for_title_season_number_episode_number^
--- Get the episode title info for the title_id, season number and episode number
-----
-
-[TIP]
-====
-- Use the `get_title_type` query to check if the type is not `tvSeries`.
-- Use the `get_seasons_in_show` query to check if the provided `season_number` is valid. For example it must be a positive number and less than or equal to the number of seasons actually in the given show.
-- Use the `get_episodes_in_season` query to check if the provided `episode_number` is valid. For example it must be a positive number and less than or equal to the number of episodes actually in the given season.
-====
-
-All of these queries should be called in your `get_show_for_title_season_and_episode` function in `imdb.py`. We've provided you with skeleton code with comments to help -- just fill it in!
-
-Finally, you should make a `get_episode` function in `main.py`, with the following signature:
-
-[source,python]
-----
-async def get_episode(title_id: str, season_number: int, episode_number: int):
-----
-====
-
-To test your endpoint, run the following in cells in your notebook.
-
-[source,python]
-----
-import requests
-
-response = requests.get("http://localhost:1728/tv/tt1475582/seasons/1/episodes/2")
-print(response.json())
-----
-
-.Output
-----
-{'title_id': 'tt1664529', 'type': 'tvEpisode', 'primary_title': 'The Blind Banker', 'original_title': 'The Blind Banker', 'is_adult': False, 'premiered': 2010, 'ended': None, 'runtime_minutes': 89, 'genres': [{'genre': 'Crime'}, {'genre': 'Drama'}, {'genre': 'Mystery'}]}
-----
-
-Also:
-
-[source,python]
-----
-import requests
-
-response = requests.get("http://localhost:1728/tv/tt1664529/seasons/1/episodes/2")
-print(response.json())
-----
-
-.Output
-----
-{'detail': "Title with title_id 'tt1664529' is not a tv series, it is a tvEpisode."}
-----
-
-Also:
-
-[source,python]
-----
-import requests
-
-response = requests.get("http://localhost:1728/tv/tt1475582/seasons/1/episodes/7")
-print(response.json())
-----
-
-And because there is no episode 7:
-
-.Output
-----
-{'detail': 'Season 1 only 4 episodes and you requested episode 7.'}
-----
-
-Also:
-
-[source,python]
-----
-import requests
-
-response = requests.get("http://localhost:1728/tv/tt1475582/seasons/5/episodes/7")
-print(response.json())
-----
-
-And because there is no season 5:
-
-.Output
-----
-{'detail': 'There are only 4 seasons for this show, you requested information about season 5.'}
-----
-
-[NOTE]
-====
-Note that this error takes precedence over the fact that there are only 4 episodes and we requested info for episode 7.
-====
-
-[WARNING]
-====
-For this project you should submit the following files:
-
-- `firstname-lastname-project09.ipynb` with output from making the requests to your API.
-- `main.py`
-- `queries.sql`
-- `imdb.py`
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5 (Optional, 0 pts)
-
-Implement the following endpoint.
-
-- `http://localhost:1728/tv/{title_id}/seasons/{season_number}/episodes`
-
-To test your endpoint, run the following in a Python cell in your notebook.
-
-[source,python]
-----
-import requests
-
-response = requests.get("http://localhost:1728/tv/tt1475582/seasons/1/episodes")
-print(response.json())
-----
-
-.Output
-----
-[{'title_id': 'tt1664529', 'type': 'tvEpisode', 'primary_title': 'The Blind Banker', 'original_title': 'The Blind Banker', 'is_adult': False, 'premiered': 2010, 'ended': None, 'runtime_minutes': 89, 'genres': [{'genre': 'Crime'}, {'genre': 'Drama'}, {'genre': 'Mystery'}]}, {'title_id': 'tt1664530', 'type': 'tvEpisode', 'primary_title': 'The Great Game', 'original_title': 'The Great Game', 'is_adult': False, 'premiered': 2010, 'ended': None, 'runtime_minutes': 89, 'genres': [{'genre': 'Crime'}, {'genre': 'Drama'}, {'genre': 'Mystery'}]}, {'title_id': 'tt1665071', 'type': 'tvEpisode', 'primary_title': 'A Study in Pink', 'original_title': 'A Study in Pink', 'is_adult': False, 'premiered': 2010, 'ended': None, 'runtime_minutes': 88, 'genres': [{'genre': 'Crime'}, {'genre': 'Drama'}, {'genre': 'Mystery'}]}, {'title_id': 'tt1815240', 'type': 'tvEpisode', 'primary_title': 'Unaired Pilot', 'original_title': 'Unaired Pilot', 'is_adult': False, 'premiered': 2010, 'ended': None, 'runtime_minutes': 55, 'genres': [{'genre': 'Crime'}, {'genre': 'Drama'}, {'genre': 'Mystery'}]}]
-----
-
-And of course, continue to have the regular errors we've had so far:
-
-[source,python]
-----
-import requests
-
-response = requests.get("http://localhost:1728/tv/tt1475582/seasons/5/episodes")
-print(response.json())
-----
-
-.Output
-----
-{'detail': 'There are only 4 seasons for this show, you requested information about season 5.'}
-----
-
-And
-
-[source,python]
-----
-import requests
-
-response = requests.get("http://localhost:1728/tv/tt1664529/seasons/5/episodes")
-print(response.json())
-----
-
-.Output
-----
-{'detail': "Title with title_id 'tt1664529' is not a tv series, it is a tvEpisode."}
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project10.adoc
deleted file mode 100644
index 44a023360..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project10.adoc
+++ /dev/null
@@ -1,350 +0,0 @@
-= STAT 39000: Project 10 -- Fall 2021
-
-**Motivation:** One of the primary ways to get and interact with data today is via APIs. APIs provide a way to access data and functionality from other applications. There are 3 very popular types of APIs that you will likely encounter in your work: RESTful APIs, GraphQL APIs, and gRPC APIs. We will address some pros and cons of each, with a focus on the most ubiquitous, RESTful APIs.
-
-**Context:** This is the third in a series of 4 projects focused around APIs. We will learn some basics about interacting and using APIs, and even build our own API.
-
-**Scope:** Python, APIs, requests, fastapi
-
-.Learning Objectives
-****
-- Understand and use the HTTP methods with the `requests` library.
-- Differentiate between graphql, REST APIs, and gRPC.
-- Write REST APIs using the `fastapi` library to deliver data and functionality to a client.
-- Identify the various components of a URL.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/movies_and_tv/imdb.db`
-
-In addition, the following is an illustration of the database to help you understand the data.
-
-image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"]
-
-== Questions
-
-=== Question 1
-
-Begin this project by cloning our repo and installing the required packages. To do so, run the following in a bash cell.
-
-[IMPORTANT]
-====
-This repository -- TheDataMine/f2021-stat39000-project10 -- is a refreshed version of project (9). We've added some more functionality, but that is about it. Since it contains the solutions to project (9), it will be released sometime on Saturday, November 6th, and at the latest, on Monday, November 8th.
-
-Until that time, you are more than welcome to use the solutions to your project (9) as a starting point for this project.
-====
-
-[source,ipython]
-----
-%%bash
-
-cd $HOME
-git clone git@github.com:TheDataMine/f2021-stat39000-project10.git
-----
-
-Then, to install the required packages, run the following in a bash cell.
-
-[source,ipython]
-----
-%%bash
-
-module unload python/f2021-s2022-py3.9.6
-cd $HOME/f2021-stat39000-project10
-poetry install
-----
-
-Next, let's identify a port that we can run our API on. In a bash cell, run the following.
-
-[source,ipython]
-----
-%%bash
-
-port
-----
-
-You will get a port number, like the following, for example.
-
-.Output
-----
-1728
-----
-
-From this point on, when we mention the port 1728, please replace it with the port number you were assigned. Open a new terminal tab so that we can run our API, alongside our notebook.
-
-Next, you'll need to add a `.env` file to your `f2021-stat39000-project10` directory, with the following content. (Pretty much just like the previous project!)
-
-----
-DATABASE_PATH=/depot/datamine/data/movies_and_tv/imdb.db
-----
-
-In your terminal, run the following.
-
-[source,bash]
-----
-module use /scratch/brown/kamstut/tdm/opt/modulefiles
-module load poetry/1.1.10
-cd $HOME/f2021-stat39000-project10
-poetry run uvicorn app.main:app --reload --port 1728
-----
-
-Upon success, you should see some output similar to:
-
-.Output
-----
-INFO: Will watch for changes in these directories: ['$HOME/f2021-stat39000-project9']
-INFO: Uvicorn running on http://127.0.0.1:1728 (Press CTRL+C to quit)
-INFO: Started reloader process [25005] using watchgod
-INFO: Started server process [25008]
-INFO: Waiting for application startup.
-INFO: Application startup complete.
-----
-
-Fantastic! Leave that running in your terminal, and test it out with the following request in a regular Python cell in your notebook.
-
-[source,python]
-----
-import requests
-my_headers = {'accept': 'application/json'}
-resp = requests.get("http://localhost:1728", headers=my_headers)
-print(resp.json())
-----
-
-You should receive a Hello World message, great!
-
-[TIP]
-====
-Throughout this project, be patient waiting for your requests to complete -- sometimes they take a while. If it is taking too long, you can always try killing the server. To do so, open the terminal tab and hold ctrl and press c. This will kill the server. Once killed, just restart it using the same command you used previously to start it.
-
-Finally, there are now 2 places to check for errors and print statements: the terminal and the notebook. When you get an error be sure to check both for useful clues!
-====
-
-[TIP]
-====
-Please test the requests in your notebook with the code we provide you. We've tested them and know that they work. If you choose to test them with a different movie/tv show/etc., you could get unexpected errors related to our `schemas.py` file -- best just to stick to the requests we provide.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-So you've written an API, now what? Well, while an API can have a variety of uses, one of the most common uses is as a _backend_ for a web application. Modern websites typically have a _frontend_ and _backend_. The frontend makes _requests_ to the backend, and the backend responds with _data_ to the frontend. The frontend then displays the data. This architecture makes it easy for developers to work independently on frontend things and backend things without have to understand every detail of the other "side" of the application.
-
-While frequently some sort of javascript framework is used for a frontend (things like reactjs, vuejs, angularjs, etc.), we can use Python and fastapi to create a super simple frontend!
-
-To get started, let's define something (just for clarity, these aren't real terms). Let's call a _backend_ request a request made with the `requests` package. This would be any request where we want the JSON formatted data as our response. Let's call a _frontend_ request a request made by a browser, or something similar. This would be any request where we want to use the data, but maybe display it using HTML, instead of JSON.
-
-The following is an example of a _backend_ request.
-
-[source,python]
-----
-import requests
-my_headers = {'accept': 'application/json'}
-resp = requests.get("http://localhost:1728", headers=my_headers)
-print(resp.json())
-----
-
-.Output
-----
-{'hello_item': 'hello', 'world_item': 'world'}
-----
-
-The following is an example of a _frontend_ request.
-
-[source,python]
-----
-from IPython.core.display import display, HTML
-my_headers = {'accept': 'application/html'}
-resp = requests.get("http://localhost:1728", headers=my_headers)
-display(HTML(resp.text))
-----
-
-Where the output will be formatted HTML -- just like you'd see in a browser.
-
-[NOTE]
-====
-We _wanted_ you to be able to just type the URLs in a browser to see the results of our frontend requests, but unfortunately, this is the best we can do for now. We are emulating a frontend request by setting the accept head to `application/html`. This is a bit of a hack, but it works.
-====
-
-Okay, now, maybe you are asking yourself -- but the two requests have the same url, `http://localhost:1728`, why don't we get the same response for both?
-
-The answer is that we are using the `accept` header to try and determine if the request is being made from a browser, or from something like the `requests` package. Check out the `root` function in the `main.py` module.
-
-We first get the header from the `request` object:
-
-[source,python]
-----
-accept = request.headers.get("accept")
-----
-
-If the header is `application/json`, then we know that the user wants to have JSON output, not HTML. If the header is `application/html`, or if the header has multiple values separated by commas, then we assume that the user is a browser or someone making a frontend request.
-
-Why is any of this important? Well, wouldn't it be cool if we could type: `http://localhost:1728/movies/tt0076759` into a browser and get our data formatted into a webpage? But then, at the same time, use the exact same endpoint to get the data formatted as JSON, in case we wanted to use the API with some program we are writing? Thats what this trick allows us to do!
-
-[IMPORTANT]
-====
-For this question, make sure to just run the "frontend" and "backend" requests in your notebook (provided above). Other than that, just try and do your best to understand what is happening in the `root` function. That's it!
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-The goal of this question (and the following questions) use our templating engine/Python package called `jinja2` to render webpages for the requests we built in the previous project. To get you started, we've provided HTML templates in the `templates` directory. These templates currently just contain boilerplate HTML structure that you will add to so our data is rendered neatly(ish).
-
-[IMPORTANT]
-====
-At this point in time you are probably feeling overwhelmed and not understanding what is going on -- that is okay, it will start to make more sense as you mess around with things. If it is any consolation -- you will **not** be writing _any_ Python code today! You'll just be using the `jinja2` package within our HTML templates. There is a small learning curve, but I will provide examples with the questions, so you can see the syntax.
-====
-
-Let's start with the following webpage:
-
-- `http://localhost:1728/movies/{title_id}`
-
-To make the "frontend" request, run the following in a cell.
-
-[source,python]
-----
-from IPython.core.display import display, HTML
-my_headers = {'accept': 'application/html'}
-resp = requests.get("http://localhost:1728/movies/tt0076759", headers=my_headers)
-display(HTML(resp.text))
-----
-
-We've set the template up to provide you with an example of a loop (see the genres section in `movie.html`), and some examples of simple data access. There are some missing pieces of information we want you to add (information in the "Facts:" section)! Please add the missing fields to the HTML template, and make a new frontend request. The results should look like the following:
-
-image::figure26.webp[Expected output for question 3, width=792, height=500, loading=lazy, title="Expected output for question 3"]
-
-To remind yourself what the JSON response for this request looks like run the following in a cell.
-
-[source,python]
-----
-import requests
-my_headers = {'accept': 'application/json'}
-resp = requests.get("http://localhost:1728/movies/tt0076759", headers=my_headers)
-print(resp.json())
-----
-
-We pass the entire `Movie` object to `jinja2`, so everything you see in the JSON response, we can access and embed in the HTML template. Notice in the `main.py` file how we are returning a single `Title` object. If you look in `schemas.py`, you can see all of the attributes of the `Title` object that you can access using dot notation. The variable itself, is named `movie` since the object we return in the `get_movies` function in `main.py` is named `movie`. So, in our template, we can access the primary title, for example using `movie.primary_title`. We can also access any other variable that exists in the `Title` class shown in `schemas.py` in the same way!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Let's say that we only like movies that premiered after 1990 (inclusive). Any other movie, we want to make the `h1` header bright red for "not going to watch _that_". Could we do that? Yes!
-
-[TIP]
-====
-To change the text color of an `h1` element, see https://www.w3schools.com/html/html_styles.asp[this link].
-====
-
-Update the `movie.html` template to do this. Check out the examples https://jinja.palletsprojects.com/en/2.10.x/templates/#if[here].
-
-To test your work, run the following two chunks of code. The first should display in red, the second should not.
-
-[source,python]
-----
-from IPython.core.display import display, HTML
-my_headers = {'accept': 'application/html'}
-resp = requests.get("http://localhost:1728/movies/tt0076759", headers=my_headers)
-display(HTML(resp.text))
-----
-
-[source,python]
-----
-from IPython.core.display import display, HTML
-my_headers = {'accept': 'application/html'}
-resp = requests.get("http://localhost:1728/movies/tt7401588", headers=my_headers)
-display(HTML(resp.text))
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Okay, great! Now we have a cool page for any movie we want to look up. Read about HTML tables https://www.w3schools.com/html/html_tables.asp[here].
-
-Modify the `episodes.html` template in the `templates` directory to display the following information in a neatly formatted table _with_ a header row: `title_id`, `primary_title`, `is_adult`, `premiered`, and `runtime_minutes`.
-
-Rather than displaying `True` or `False` for the `is_adult` field, instead display the text `Yes` or `No`.
-
-[TIP]
-====
-Use conditionals in `jinja2` to display the text `Yes` or `No` for the `is_adult` field.
-====
-
-[TIP]
-====
-Check out the `get_episodes` function in `main.py` to see how we are returning a list of `Title` objects (that represent episodes). Note that the _name_ of the variable sent to the template is `episodes`, which is a _list_ of episodes. Use the name `episodes` in your template to access the data.
-====
-
-[TIP]
-====
-Remember, while working in your template, `episodes.html`, you can access the _list_ of `Title` objects using the name `episodes`. With that being said, **be careful** -- you don't want to try `episodes.primary_title` or `episodes.is_adult`, because that will try to access the `primary_title` and `is_adult` fields of the `Title` object, which you don't want to do, because `episodes` is a **list** of `Title` objects, not a single `Title` object.
-
-Therefore, you should use a loop to access each individual `Title` object in the `episodes` list.
-====
-
-To take a look at the list of `Title` objects returned by the `get_episodes` function, in JSON format, run the following in a cell.
-
-[source,python]
-----
-import requests
-my_headers = {'accept': 'application/json'}
-resp = requests.get("http://localhost:1728/tv/tt1475582/seasons/1/episodes", headers=my_headers)
-print(resp.json())
-----
-
-To test your work, run the following in a cell.
-
-[source,python]
-----
-from IPython.core.display import display, HTML
-my_headers = {'accept': 'application/html'}
-resp = requests.get("http://localhost:1728/tv/tt1475582/seasons/1/episodes", headers=my_headers)
-display(HTML(resp.text))
-----
-
-The output should look like the following:
-
-image::figure27.webp[Expected results question 5, width=792, height=500, loading=lazy, title="Expected results question 5"]
-
-[WARNING]
-====
-For this project you should submit the following files:
-
-- `firstname-lastname-project10.ipynb` with output from making the requests to your API.
-- `movie.html`
-- `episodes.html`
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project11.adoc
deleted file mode 100644
index 25a89bdc7..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project11.adoc
+++ /dev/null
@@ -1,112 +0,0 @@
-= STAT 39000: Project 11-- Fall 2021
-
-**Motivation:** One of the primary ways to get and interact with data today is via APIs. APIs provide a way to access data and functionality from other applications. There are 3 very popular types of APIs that you will likely encounter in your work: RESTful APIs, GraphQL APIs, and gRPC APIs. We will address some pros and cons of each, with a focus on the most ubiquitous, RESTful APIs.
-
-**Context:** This is the fourth in a series of 4 projects focused around APIs. At this point in time there will be varying levels of understanding of APIs, how to use them, and how to write them. One of the "coolest" parts about APIs is how flexible they are. It is kind of like a website, the limitations are close to what you can imagine. Every once in a while we like to write projects that are open ended and allow you to do whatever you want within certain guidelines. This will be such a project.
-
-**Scope:** Python, APIs, requests, fastapi
-
-.Learning Objectives
-****
-- Understand and use the HTTP methods with the `requests` library.
-- Differentiate between graphql, REST APIs, and gRPC.
-- Write REST APIs using the `fastapi` library to deliver data and functionality to a client.
-- Identify the various components of a URL.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/depot/datamine/data/**`
-
-You are free to use any dataset(s) you wish for this project. The only requirement is that there is _some_ data-oriented component to the API you build, and that there is a way for anyone (in the course) to access the data. So easily downloadable datasets, datasets in the data depot, web scraping, etc., are all acceptable.
-
-== Questions
-
-=== Overview
-
-At a high level, this project has 3 parts.
-
-. Write an API that does _something_ with data.
-. Provide a series of images, or a video/screen recording, that demonstrates what your API does.
-. If, you chose to provide a series of images, provide text that explains what the images are showing, and how your API behaves. If you chose to provide a video/screen recording, a verbal explanation can be used in lieu of text.
-
-
-If you choose to illustrate your API with **images**.
-
-.Items to submit
-====
-- A jupyter notebook with images and explanations of what you are showing in the images, and what your API is doing. Feel free to write about any struggles you ran into, how you fixed them (or why you couldn't figure out _how_ to fix them -- that is perfectly OK if that happens!).
-====
-
-If you choose to illustrate your API with **a video**.
-
-.Items to submit
-====
-- A video showing off your API and explaining what it does. Feel free to write about any struggles you ran into, how you fixed them (or why you couldn't figure out _how_ to fix them -- that is perfectly OK if that happens!).
-- If your video doesn't contain audio, include an `explain.txt` file with a written explanation of what your video is showing.
-====
-
-=== Part 1
-
-_Write an API that does **something** with data._
-
-**Requirements:**
-
-. The _something_ your API does must be non-trivial. In other words, don't _just_ regurgitate the data from the dataset. Wrangle the data, recombine it in a useful way, transform it into a graphic, summarize it, etc.
-. Just put 1 project's worth of _effort_ into your API. This will vary from student to student, but just show us some effort. We aren't looking for APIs that are perfect, or (anywhere near) as complicated as the previous projects you've worked on -- they can be _much_ simpler -- especially since you are putting it together from (basically) scratch!
-
-The open-ended nature of this project may frustrate some of you, so we will provide some ideas below that would be accepted for full credit.
-
-- Build on and add new features to an API from a previous project.
-- Use a feature of fastapi that you haven't seen before. For example, something like https://github.com/TheDataMine/fastapidemo[this] would be _more_ than enough. (Building on that demo is perfectly acceptable to do for this project too.) Other ideas could be using websockets (using fastapi), graphQL (using fastapi), a form that does something when you submit it, etc (these are all _way_ more than we expect from you).
-- Incorporate other skills you've learned previously (like scraping data, for instance) into your API.
-- You could write an API that scrapes the-examples-book.com and gives you the link to the newest 190/290/390 project (or something like that).
-- You could write a https://fastapi.tiangolo.com/tutorial/middleware/[middleware] that does something with the request and response for one of our previous APIs.
-- You could write an API that scrapes data from https://purdue.edu/directory and returns something.
-- You could write an API that returns a random "The Office" quote using a dataset in the data depot (this is an example of about the minimum we would expect from your API).
-
-Have fun, be creative, and know that we understand it is a stressful time and we will be lenient and forgiving with grading. This is about trying something new and maybe having some fun and incorporating your own interests into the project. _Please_ feel 100% free to use any of the previous projects as a starting point for your code -- we will _not_ consider that "copying" at all.
-
-=== Part 2
-
-_Provide a series of images, or a video/screen recording, that demonstrates what your API does._
-
-If you choose to use images. Submit a Jupyter notebook with images, followed by text, explaining what is in the images. As a reminder, you can insert an image using markdown, as follows.
-
-[source,ipython]
-----
-%%markdown
-
-![](/absolute/path/to/image.png)
-----
-
-Again, this doesn't need to be perfect, just add enough details so we can get a good idea of what you created.
-
-If you choose to do a screen recording, please add voiceover so you can explain what you are doing while you are doing it. Alternatively, feel free to have a silent video, but please also submit a `explain.txt` file with a verbal explanation of what your API does.
-
-The final requirement for **both** the video and image choices are to include a portion where you dig into a critical piece of your code and explain what it does. This is just so we see some of your code and show us you understand it.
-
-[TIP]
-====
-On a mac, an easy way to take a screen recording is to type kbd:[Ctrl + Cmd + 5], and then click on the record screen button option on the lower part of your screen. When you want to stop recording push the stop button in the menubar at the top of your screen where the date and time is shown.
-====
-
-[TIP]
-====
-On a windows machine, https://www.laptopmag.com/articles/how-to-video-screen-capture-windows-10[here] are some directions.
-====
-
-=== Part 3
-
-_If, you chose to provide a series of images, provide text that explains what the images are showing, and how your API behaves. If you chose to provide a video/screen recording, a verbal explanation can be used in lieu of text._
-
-This was explained in part (2), however, we are reiterating it here.
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project12.adoc
deleted file mode 100644
index 8f0fd73a3..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project12.adoc
+++ /dev/null
@@ -1,351 +0,0 @@
-= STAT 39000: Project 12 -- Fall 2021
-
-**Motivation:** Containers are a modern solution to packaging and shipping some sort of code in a reproducible and portable way. When dealing with R and Python code in industry, it is highly likely that you will eventually have a need to work with Docker, or some other container-based solution. It is best to learn the basics so the basic concepts aren't completely foreign to you.
-
-**Context:** This is the first project in a 2 project series where we learn about containers, and one of the most popular container-based solutions, Docker.
-
-**Scope:** Docker, unix, Python
-
-.Learning Objectives
-****
-- Understand the various components involved with containers: Dockerfile/build file, container image, container registry, etc.
-- Understand how to push and pull images to and from a container registry.
-- Understand the basic Dockerfile instructions.
-- Understand how to build a container image.
-- Understand how to run a container image.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Questions
-
-=== Question 1
-
-First thing first. Please read https://www.padok.fr/en/blog/container-docker-oci?utm_source=pocket_mylist[this fantastic article] for a great introduction to containers. Afterwards, please review the content we have available xref:containers:index.adoc[here].
-
-In this project, we have a special challenge. Brown does _not_ have Docker installed. This is due to a variety of reasons. Brown _does_ have a tool called Singularity installed, however, it is different enough from more common containerization tools, that it does not make sense to learn for your first "container" experience.
-
-To solve this issue, we've created a virtual machine that runs Ubuntu, and has Docker pre-installed and configured for you to use. To be clear, the majority of this project will revolve around the command line from within Jupyter Lab. We will specifically state the "deliverables" which will mainly be text or images that are copied and pasted in Markdown cells.
-
-Please login and launch a Jupyter Lab session. Create a new notebook to put your solutions, and open up a terminal window beside your notebook.
-
-In your terminal, navigate to `/depot/datamine/apps/qemu/scripts/`. You should find 4 scripts. They perform the following operations, respectively.
-
-. Copies our VM image from `/depot/datamine/apps/qemu/images/` to `/scratch/brown/$USER/`, so you each get to work on your _own_ (virtual) machine.
-. Creates a SLURM job and provides you a shell to that job. The job will last 4 hours, provide you with 4 cores, and will have ~6GB of RAM.
-. Runs the virtual machine in the background, in your SLURM job.
-. SSH's into the virtual machine.
-
-Run the scripts in your Terminal, in order, from 1-4.
-
-[source,bash]
-----
-cd /depot/datamine/apps/qemu/scripts/
-./1_copy_vm.sh
-----
-
-[source,bash]
-----
-./2_grab_a_node.sh
-----
-
-[source,bash]
-----
-./3_run_a_vm.sh
-----
-
-[IMPORTANT]
-====
-You may need to press enter to free up the command line.
-====
-
-[source,bash]
-----
-./4_connect_to_vm.sh
-----
-
-[IMPORTANT]
-====
-You will eventually be asked for a password. Enter `thedatamine`.
-====
-
-[NOTE]
-====
-Remember, to add an image or screenshot to a markdown cell, you can use the following syntax:
-
-----
-![](/home/kamstut/my_image.png)
-----
-====
-
-.Items to submit
-====
-- A screenshot of your terminal window after running the 4 scripts.
-====
-
-=== Question 2
-
-Awesome! Your terminal is now connected to an instance of Ubuntu with Docker already installed and configured for you! Now, let's get to work.
-
-First thing is first. Let's test out _pulling_ an image from the Docker Hub. `wernight/funbox` is a fun image to do some wacky things on a command line. Pull the image (https://hub.docker.com/r/wernight/funbox), and verify that the image is available on your system using `docker images`.
-
-Run the following to get an ascii aquarium.
-
-[source,bash]
-----
-docker run -it wernight/funbox asciiquarium
-----
-
-Wow! That is wild! You can run this program on _any_ system where an OCI compliant runtime exists -- very cool!
-
-To quit the program, press kdb:[Ctrl + c].
-
-For this question, submit a screenshot of the running asciiquarium program.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Okay, that was fun, but let's do something a little bit more practical. Check out the `~/projects/whin` directory in your VM. You should pretty quickly realize that this is our version of the WHIN API that we used earlier on in project (8).
-
-If you recall, we had a lot of "extra" steps we had to take in order to run the API. We had to:
-
-- Install the Python dependencies.
-- Activate the appropriate Python environment.
-- Set the `DATABASE_PATH` environment variable.
-- Remember some long and complicated command.
-
-This is a fantastic example of when _containerizing_ your app could be a great idea!
-
-Let's begin by writing our own Dockerfile.
-
-First thing is first. We want our image to contain the correct version of Python for our app. Our app requires at least Python 3.9. Let's see if we can find a _base_ image that has Python 3.9 or later. Google "Python docker image" and you will find the following link: https://hub.docker.com/_/python
-
-Here, we will find a wide variety of different "official" Python docker images. A great place to start. If you click on the "Tags" tab, you will be able to scroll through a wide variety of different versions of Python + operating systems. A great Linux distribution is Debian.
-
-[NOTE]
-====
-Fun fact: Debian/the Debian project (one of the, if not _the_ most popular linux distribution) was founded by a Purdue alum, https://en.wikipedia.org/wiki/Ian_Murdock[Ian Murdock].
-====
-
-Okay, let's go for the Python 3.9.9 + Bullseye (Debian) image. The tag for the image is `python:3.9.9-bullseye`. But wait a second. If you look at the space required for the base image -- it is _already_ up to 370 or so MB -- that is quite a bit! Maybe there is a lighter weight option? If you search for "slim" you will find an image with the tag `python:3.9.9-slim-bullseye` that takes up only 45 MB by default -- much better.
-
-Create a file called `Dockerfile` in the `~/projects/whin` directory. Use vim/emacs/nano to edit the file to look like this:
-
-.Dockerfile
-----
-FROM python:3.9.9-slim-bullseye
-----
-
-Now, let's build our image.
-
-[source,bash]
-----
-docker build -t whin:0.0.1 .
-----
-
-Once created, you should be able to view your image by running the following.
-
-[source,bash]
-----
-docker images
-----
-
-Now, let's run our image. After running `docker images`, if you look under the `IMAGE` column, you should see an id for you image -- something like `3dk35bdl`. To run your image, do the following.
-
-[source,bash]
-----
-docker run -dit 3dk35bdl
-----
-
-Be sure to replace `3dk35bdl` with the id of your image. Great! Your image should now be running. Find out by running the following.
-
-[source,bash]
-----
-docker ps
-----
-
-Under the `NAMES` column, you will see the name of your running container -- very cool! How does this test out anything? Don't we want to see if we have Python 3.9 running like we want it to? Yes! Let's get a bash shell _inside_ our container. To do so run the following.
-
-[source,bash]
-----
-docker exec -it suspicious_lumiere /bin/bash
-----
-
-Replace `suspicious_lumiere` with the name of your container. You should now be in a bash shell. Awesome! Run the following to see what version of Python we have installed.
-
-[source,bash]
-----
-python --version
-----
-
-.Output
-----
-Python 3.9.9
-----
-
-Awesome! So far so good! To exit the container, type and run `exit`. Take a screenshot of your terminal after following these steps and add it to your notebook in a markdown cell.
-
-To clean up and stop the container, run the following.
-
-[source,bash]
-----
-docker stop suspicious_lumiere
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Okay, great! We have version 0.0.1 of our `whin` image. Great.
-
-Now let's make this thing useful. Use vim/emacs/nano to edit the `~/projects/whin/Dockerfile` to look like this:
-
-.Dockerfile
-----
-FROM python:3.9.9-slim-bullseye
-
-WORKDIR /app
-
-RUN python -m pip install fastapi[all] pandas aiosql fastapi-responses cyksuid httpie
-
-COPY . .
-
-EXPOSE 21650
-
-CMD ["uvicorn", "app.main:app", "--reload", "--port", "21650", "--host", "0.0.0.0"]
-----
-
-Here, do your best to explain what each line of code does. Build version 0.0.2 of your image, and run it.
-
-Okay, in theory, that last line _should_ run our API -- awesome! Let's check the logs to see if it is working.
-
-[source,bash]
-----
-docker logs my_container_name
-----
-
-[TIP]
-====
-Remember, to get your container name, run `docker ps` and look under the `NAME` column.
-====
-
-What you _should_ get is a Python error! Something about NoneType. Whoops! We forgot to include the `DATABASE_PATH` environment variable so our API knows where our WHIN database is. That is critical to our API.
-
-[TIP]
-====
-https://docs.docker.com/engine/reference/builder/#env[This command] will be very useful to achieve this!
-====
-
-Modify our Dockerfile to include the `DATABASE_PATH` environment variable with a value `/home/tdm-user/projects/whin/whin.db`. Rebuild your image (as version 0.0.2), and run it. Check the logs again, does it appear to be working?
-
-.Items to submit
-====
-- The fixed Dockerfile contents in a markdown cell as code (surrounded by 3 backticks).
-- A screenshot (or more) of the terminal output from running the various commands.
-====
-
-=== Question 5
-
-Okay, there is one step left. Let's see if the API is _really_ fully working by making a request to it. First, get a shell to the running container.
-
-[source,bash]
-----
-docker exec -it container_name /bin/bash
-----
-
-[TIP]
-====
-Remember, to get your `container_name` list the running containers using `docker ps`.
-====
-
-One inside the container, let's make a request to the API that is running. Run the following:
-
-[source,bash]
-----
-python -m httpie localhost:21650
-----
-
-If all is well you _should_ get:
-
-.Output
-----
-HTTP/1.1 200 OK
-content-length: 25
-content-type: application/json
-date: Thu, 18 Nov 2021 20:28:47 GMT
-server: uvicorn
-
-{
- "message": "Hello World"
-}
-----
-
-Awesome! You can see our API is definitely working, cool!
-
-Okay, one final test. Let's exit the container and make a request to the API again. After all, it wouldn't be that useful if we had to essentially login to a container when we want to access an API running _in_ that container, would it?
-
-[source,bash]
-----
-http localhost:21650
-----
-
-Uh oh! Although our API is running smoothly _inside_ of the container, we have no way of accessing it _outside_ of the container. Remember, `EXPOSE` only _signals_ that we _want_ to expose that port, it doesn't actually do that for us. No worries, this can be easily fixed.
-
-[source,bash]
-----
-docker run -dit -p 21650:21650 --name my_container_name 3kdgj024jn
-----
-
-[TIP]
-====
-Here, we named the resulting container `my_container_name`. This is a cool trick if you get tired of running `docker ps` to get the name of a newly running container.
-====
-
-Where `3kdgj024jn` is the id of your image. Now, let's try and access the API again.
-
-[source,bash]
-----
-http localhost:21650
-----
-
-Voila! It works! The following is an equivalent run statement:
-
-[source,bash]
-----
-docker run -dit -p 21650 --name my_container_name 3kdgj024jn
-----
-
-However, if you want to specify that the API _internally_ is using port 21650, but we want to expose the API running _inside_ our container to _outside_ our container on a different port, say, port 5555, we could run the following.
-
-[source,bash]
-----
-docker run -dit -p 5555:21650 --name my_container_name 3kdgj024jn
-----
-
-Then, you could access the API by running the following:
-
-[source,bash]
-----
-http localhost:5555
-----
-
-While our request goes to port 5555, once the request hits the container, it is routed to port 21650 inside the container, which is where our API is running. This can be confusing a may take some experimentation until you are comfortable with it.
-
-.Items to submit
-====
-- Screenshot(s) showing the input and output from the terminal.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project13.adoc
deleted file mode 100644
index 3d6c69c42..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project13.adoc
+++ /dev/null
@@ -1,393 +0,0 @@
-= STAT 39000: Project 13 -- Fall 2021
-
-**Motivation:** Containers are a modern solution to packaging and shipping some sort of code in a reproducible and portable way. When dealing with R and Python code in industry, it is highly likely that you will eventually have a need to work with Docker, or some other container-based solution. It is best to learn the basics so the basic concepts aren’t completely foreign to you.
-
-**Context:** This is the second project in a 2 project series where we learn about containers.
-
-**Scope:** unix, Docker, Python, R, Singularity
-
-.Learning Objectives
-****
-- Understand the various components involved with containers: Dockerfile/build file, container image, container registry, etc.
-- Understand how to push and pull images to and from a container registry.
-- Understand the basic Dockerfile instructions.
-- Understand how to build a container image.
-- Understand how to run a container image.
-- Use singularity to run a container image.
-- State the primary differences between Docker and Singularity.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Questions
-
-=== Question 1
-
-Containers solve a real problem. In this project, we are going to demonstrate a real-world example of code that doesn't prove to be portable, and we will _fix_ it using containers.
-
-Check out the code (questions and solutions) in the https://thedatamine.github.io/the-examples-book/projects.html#p03-290[Fall 2020 STAT 29000 Project 3], and try to run the solution for question (4) in your Jupyter Notebook. You'll quickly notice that the code no longer works, _as-is_. In this case it is (partly) due to incorrect paths for the Firefox executable as well as the Geckodriver executable. These changes occurred because we switched systems from Scholar to Brown.
-
-_What if_ we could create a container to run this function on any system with a OCI compliant engine and/or runtime? Let's try!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Okay, below is a modified version of the code from the previous question. All we have done is turned it into a script that would be run as follows:
-
-[source,bash]
-----
-python get_price.py zip 47906
-----
-
-Okay, here it is:
-
-.get_price.py
-[source,python]
-----
-import sys
-import re
-import os
-import time
-import argparse
-
-from selenium import webdriver
-from selenium.webdriver.common.keys import Keys
-from selenium.webdriver.firefox.options import Options
-from selenium.common.exceptions import NoSuchElementException
-from selenium.webdriver.common.by import By
-from selenium.webdriver.firefox.service import Service
-
-def avg_house_cost(zip: str) -> float:
- firefox_options = Options()
- firefox_options.add_argument("window-size=1920,1080")
- firefox_options.add_argument("--headless") # Headless mode means no GUI
- firefox_options.add_argument("start-maximized")
- firefox_options.add_argument("disable-infobars")
- firefox_options.add_argument("--disable-extensions")
- firefox_options.add_argument("--no-sandbox")
- firefox_options.add_argument("--disable-dev-shm-usage")
- firefox_options.binary_location = '/class/datamine/apps/firefox/firefox'
-
- service = Service('/class/datamine/apps/geckodriver', log_path=os.path.devnull)
-
- driver = webdriver.Firefox(options=firefox_options, service=service)
- url = 'https://www.trulia.com/'
- driver.get(url)
-
- search_input = driver.find_element(By.ID, "banner-search")
- search_input.send_keys(zip)
- search_input.send_keys(Keys.RETURN)
- time.sleep(10)
-
- allbed_button = driver.find_element(By.XPATH, "//button[@data-testid='srp-xxl-bedrooms-filter-button']/ancestor::li")
- allbed_button.click()
- time.sleep(2)
-
- bed_button = driver.find_element(By.XPATH, "//button[contains(text(), '3+')]")
- bed_button.click()
- time.sleep(3)
-
- price_elements = driver.find_elements(By.XPATH, "(//ul[@data-testid='search-result-list-container'])[1]//div[@data-testid='property-price']")
- prices = [int(re.sub("[^0-9]", "", e.text)) for e in price_elements]
-
- driver.quit()
-
- return sum(prices)/len(prices)
-
-
-def main():
- parser = argparse.ArgumentParser()
-
- subparsers = parser.add_subparsers(help="possible commands", dest="command")
-
- zip_parser = subparsers.add_parser("zip", help="search by zipcode")
- zip_parser.add_argument("zip_code", help="the zip code to search for")
-
- if len(sys.argv) == 1:
- parser.print_help()
- parser.exit()
-
- args = parser.parse_args()
-
- if args.command == "zip":
- print(avg_house_cost(f'{args.zip_code}'))
-
-
-if __name__ == '__main__':
- main()
-----
-
-First thing is first, we need to launch and connect to our VM so we can create our Dockerfile and build our container image.
-
-If you have not already done so, please login and launch a Jupyter Lab session. Create a new notebook to put your solutions, and open up a terminal window beside your notebook.
-
-In your terminal, navigate to `/depot/datamine/apps/qemu/scripts/`. You should find 4 scripts. They perform the following operations, respectively.
-
-. Copies our VM image from `/depot/datamine/apps/qemu/images/` to `/scratch/brown/$USER/`, so you each get to work on your _own_ (virtual) machine.
-. Creates a SLURM job and provides you a shell to that job. The job will last 4 hours, provide you with 4 cores, and will have ~6GB of RAM.
-. Runs the virtual machine in the background, in your SLURM job.
-. SSH's into the virtual machine.
-
-Run the scripts in your Terminal, in order, from 1-4.
-
-[source,bash]
-----
-cd /depot/datamine/apps/qemu/scripts/
-./1_copy_vm.sh
-----
-
-[source,bash]
-----
-./2_grab_a_node.sh
-----
-
-[source,bash]
-----
-./3_run_a_vm.sh
-----
-
-[IMPORTANT]
-====
-You may need to press enter to free up the command line.
-====
-
-[source,bash]
-----
-./4_connect_to_vm.sh
-----
-
-[IMPORTANT]
-====
-You will eventually be asked for a password. Enter `thedatamine`.
-====
-
-[NOTE]
-====
-Remember, to add an image or screenshot to a markdown cell, you can use the following syntax:
-
-----
-![](/home/kamstut/my_image.png)
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Create a new folder in your $HOME directory (_inside_ your VM) called `project13`. Inside the folder, place the `get_price.py` code into a file called `get_price.py`. Give the file execute permissions:
-
-[source,bash]
-----
-chmod +x get_price.py
-----
-
-Great! Next, create a Dockerfile in the `project13` folder. The following is some _starter_ content for your Dockerfile.
-
-.Dockerfile
-----
-FROM python:3.9.9-slim-bullseye <1>
-
-RUN apt update && apt install -y wget bzip2 firefox-esr <2>
-
-<3>
-
-RUN wget --output-document=geckodriver.tar.gz https://github.com/mozilla/geckodriver/releases/download/v0.30.0/geckodriver-v0.30.0-linux64.tar.gz && \
- tar -xvf geckodriver.tar.gz && \
- rm geckodriver.tar.gz && \
- chmod +x geckodriver <4>
-
-RUN python -m pip install selenium <5>
-
-<6>
-
-<7>
-
-<8>
-<9>
-----
-
-<1> The first line should look familiar. This is just our base image that has Python3 fully locked and loaded and ready for us to use.
-
-<2> The second line installed 3 critical packages in our container. The first is `wget`, which we use to download compatible versions of Geckodriver. The second is `bzip2`, which we use to unzip the Geckodriver archives. The third is firefox, which is installed to `/usr/bin/firefox`.
-
-<3> Here, I want you to change the work directory to `/vendor`, so our Geckodriver binary lives directly in `/vendor/geckodriver`.
-
-<4> The next line downloads the Geckodriver program, and extracts it.
-
-<5> This line installed the `selenium` Python package which is needed for our `get_price.py` script.
-
-<6> Here, I want you to change the work directory to `/workspace` -- this way our `get_price.py` script will be copied in the `/workspace` directory.
-
-<7> Copy the `get_price.py` code into the `/workspace` directory.
-+
-[CAUTION]
-====
-You _may_ want to modify the script! There are two locations in the script: `/class/datamine/apps/firefox/firefox` as well as `/class/datamine/apps/geckodriver`. These _should_ be the location of the firefox executable and the geckodriver executable. Inside our container, however, these locations will be different! You will need to change the `/class/datamine/apps/firefox/firefox` to the location of the firefox executable, `/usr/bin/firefox`. You will need to change the `/class/datamine/apps/geckodriver` to the location of the geckodriver executable, `/vendor/geckodriver`.
-====
-+
-<8> Here, I want you to use the `ENTRYPOINT` command to place the commands that you _always_ want to run.
-+
-[TIP]
-====
-It will be 3 of the 4 of the following (in quotes in the right format):
-
-----
-python get_price.py zip 47906
-----
-====
-+
-<9> Here, I want you to use the `CMD` command to place a default zip code to search for. The `CMD` command will get overwritten by commands you enter in the terminal.
-+
-[TIP]
-====
-For example:
-
-----
-CMD ["47906"]
-----
-====
-
-The combination of (8) and (9) allow for the following functionality.
-
-[source,bash]
-----
-docker run ABC123XYZ
-----
-
-.Output
-----
-319876.0 # default price for 47906 (our default zip passed in (9))
-----
-
-Or, if you want to search for a zip code that is _not_ the default zip code (47906 in my example).
-
-[source,bash]
-----
-docker run ABC123XYZ 63026
-----
-
-.Output
-----
-498393.15 # price for 63026
-----
-
-Very cool!
-
-Okay, lets build your image.
-
-[source,bash]
-----
-docker build -t pricer:latest .
-----
-
-Upon success, you should be able to run the following to get the image id.
-
-[source,bash]
-----
-docker inspect pricer:latest --format '{{ .ID }}'
-----
-
-.Output
-----
-sha256:skjdbgf02u4ntb2j4tn
-----
-
-Then to test your image, run the following:
-
-[source,bash]
-----
-docker run skjdbgf02u4ntb2j4tn
-----
-
-[IMPORTANT]
-====
-Here, replace skjdbgf02u4ntb2j4tn with _your_ image id.
-====
-
-Then, to test a different, non-default zip code, run the following:
-
-[source,bash]
-----
-docker run skjdbgf02u4ntb2j4tn 63026
-----
-
-[IMPORTANT]
-====
-Make sure 63026 is a zip code that is different from your default zip code.
-====
-
-Awesome job! Okay, now, take some screenshots of all your hard work, and add them to your Jupyter Notebook in a markdown cell. Please also include your Dockerfile contents.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-[IMPORTANT]
-====
-You do _not_ need to complete the previous questions to complete this one.
-====
-
-So all the talk about portability, yet we've been working on the same VM. Well, let's use Singularity on Brown to run our code!
-
-[NOTE]
-====
-Singularity is a tool _similar_ to Docker, but different in many ways. The important thing to realize here is that since we have a OCI compliant image publicly available, we can use Singularity to run our code. Otherwise, it is safe to just think of this as a different "docker" that works on Brown (for now).
-====
-
-First step is to exit your VM if you have not already. Just run `exit`.
-
-Then, while in Brown, _pull_ our image. We've uploaded a correct version of the image for anyone to use. To pull the image using Singularity, run the following command.
-
-[source,bash]
-----
-cd $HOME
-singularity pull docker://kevinamstutz/pricer:latest
-----
-
-This may take a couple minutes to run. Once complete, you will see a SIF file in your $HOME directory called `pricer_latest.sif`. Think of this file as your container, but rather than accessing it using an engine (for example with `docker images`), you have a file.
-
-Then, to run the image, run the following command.
-
-[source,bash]
-----
-cd $HOME
-singularity run --cleanenv --pwd '/workspace/' pricer_latest.sif
-----
-
-[NOTE]
-====
-You may notice the extra argument `--cleanenv`. This is to prevent environment variables on Brown from leaking into our container. In a lot of ways it doesn't make much sense why this wouldn't be a default.
-
-In addition, the `WORKDIR` command is not respected by Singularity. This feature makes sense due to some core differences in design, however, it _does_ make it marginally more difficult to use images built using Docker, and as a result makes it less reliable to simply pull and image and run it. This is what the `--pwd '/workspace/'` argument is for. With that being said, if you don't already _know_ the location from which the container expects to run, this can lead to more work.
-====
-
-Then, to give it a non-default zip code, run the following command.
-
-[source,bash]
-----
-singularity run --cleanenv --pwd '/workspace/' pricer_latest.sif 33004
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-projects.adoc
deleted file mode 100644
index 60d4d5861..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-projects.adoc
+++ /dev/null
@@ -1,59 +0,0 @@
-= STAT 39000
-
-== Project links
-
-[NOTE]
-====
-Only the best 10 of 13 projects will count towards your grade.
-====
-
-[CAUTION]
-====
-Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses.
-====
-
-* xref:fall2021/39000/39000-f2021-officehours.adoc[STAT 39000 Office Hours for Fall 2021]
-* xref:fall2021/39000/39000-f2021-project01.adoc[Project 1: Review: Gettings started with Jupyter Lab]
-* xref:fall2021/39000/39000-f2021-project02.adoc[Project 2: Python documentation: part I]
-* xref:fall2021/39000/39000-f2021-project03.adoc[Project 3: Python documentation: part II]
-* xref:fall2021/39000/39000-f2021-project04.adoc[Project 4: Testing in Python: part I]
-* xref:fall2021/39000/39000-f2021-project05.adoc[Project 5: Testing in Python: part II]
-* xref:fall2021/39000/39000-f2021-project06.adoc[Project 6: Virtual environments, git, & sharing Python code: part I]
-* xref:fall2021/39000/39000-f2021-project07.adoc[Project 7: Virtual environments, git, & sharing Python code: part II]
-* xref:fall2021/39000/39000-f2021-project08.adoc[Project 8: Virtual environments, git, & sharing Python code: part III & APIs: part I]
-* xref:fall2021/39000/39000-f2021-project09.adoc[Project 9: APIs: part II]
-* xref:fall2021/39000/39000-f2021-project10.adoc[Project 10: APIs: part III]
-* xref:fall2021/39000/39000-f2021-project11.adoc[Project 11: APIs: part IV]
-* xref:fall2021/39000/39000-f2021-project12.adoc[Project 12: Containerization: part I]
-* xref:fall2021/39000/39000-f2021-project13.adoc[Project 13: Containerization: part II]
-
-[WARNING]
-====
-Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:55pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete.
-
-**Always** double check that the work that you submitted was uploaded properly. After submitting your project in Gradescope, you will be able to download the project to verify that the content you submitted is what the graders will see. You will **not** get credit for or be able to re-submit your work if you accidentally uploaded the wrong project, or anything else. It is your responsibility to ensure that you are uploading the correct content.
-
-Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza.
-====
-
-== Piazza
-
-=== Sign up
-
-https://piazza.com/purdue/fall2021/stat39000
-
-=== Link
-
-https://piazza.com/purdue/fall2021/stat39000/home
-
-== Syllabus
-
-++++
-include::book:ROOT:partial$syllabus.adoc[]
-++++
-
-== Office hour schedule
-
-++++
-include::book:ROOT:partial$office-hour-schedule.adoc[]
-++++
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/logistics/19000-f2021-officehours.adoc b/projects-appendix/modules/ROOT/pages/fall2021/logistics/19000-f2021-officehours.adoc
deleted file mode 100644
index 76ccbcd40..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/logistics/19000-f2021-officehours.adoc
+++ /dev/null
@@ -1,463 +0,0 @@
-= STAT 19000 Office Hours for Fall 2021
-
-It might be helpful to also have the office hours for STAT 29000 and STAT 39000:
-
-xref:logistics/29000-f2021-officehours.adoc[STAT 29000 Office Hours for Fall 2021]
-
-xref:logistics/39000-f2021-officehours.adoc[STAT 39000 Office Hours for Fall 2021]
-
-and it might be helpful to look at the
-xref:logistics/officehours.adoc[general office hours policies].
-
-The STAT 19000 office hours and WebEx addresses are the following:
-
-Webex addresses for TAs, Dr Ward, and Kevin Amstutz
-
-[cols="2,1,4"]
-|===
-|TA Name |Class |Webex chat room URL
-
-|Dr Ward (seminars)
-|all
-|https://purdue.webex.com/meet/mdw
-
-|Kevin Amstutz
-|all
-|https://purdue.webex.com/meet/kamstut
-
-|Melissa Cai Shi
-|19000
-|https://purdue.webex.com/meet/mcaishi
-
-|Shreyas Chickerur
-|19000
-|https://purdue-student.webex.com/meet/schicker
-
-|Nihar Chintamaneni
-|19000
-|https://purdue-student.webex.com/meet/chintamn
-
-|Sumeeth Guda
-|19000
-|https://purdue-student.webex.com/meet/sguda
-
-|Jonah Hu
-|19000
-|https://purdue-student.webex.com/meet/hu625
-
-|Darren Iyer
-|19000
-|https://purdue-student.webex.com/meet/iyerd
-
-|Pramey Kabra
-|19000
-|https://purdue-student.webex.com/meet/kabrap
-
-|Ishika Kamchetty
-|19000
-|https://purdue-student.webex.com/meet/ikamchet
-
-|Jackson Karshen
-|19000
-|https://purdue-student.webex.com/meet/jkarshe
-
-|Bhargavi Katuru
-|19000
-|https://purdue-student.webex.com/meet/bkaturu
-
-|Michael Kruse
-|19000
-|https://purdue-student.webex.com/meet/kruseml
-
-|Ankush Maheshwari
-|19000
-|https://purdue-student.webex.com/meet/mahesh20
-
-|Hyeong Park
-|19000
-|https://purdue-student.webex.com/meet/park1119
-
-|Vandana Prabhu
-|19000
-|https://purdue-student.webex.com/meet/prabhu11
-
-|Meenu Ramakrishnan
-|19000
-|https://purdue-student.webex.com/meet/ramakr20
-
-|Rthvik Raviprakash
-|19000
-|https://purdue-student.webex.com/meet/rravipra
-
-|Chintan Sawla
-|19000
-|https://purdue-student.webex.com/meet/csawla
-
-|Mridhula Srinivasa
-|19000
-|https://purdue-student.webex.com/meet/sriniv99
-
-|Tanya Uppal
-|19000
-|https://purdue-student.webex.com/meet/tuppal
-
-|Keerthana Vegesna
-|19000
-|https://purdue-student.webex.com/meet/vvegesna
-
-|Maddie Woodrow
-|19000
-|https://purdue-student.webex.com/meet/mwoodrow
-
-|Adrienne Zhang
-|19000
-|https://purdue-student.webex.com/meet/zhan4000
-|===
-
-[cols="1,1,1,1,1,1,1"]
-|===
-|Time (ET) |Sunday |Monday |Tuesday |Wednesday |Thursday |Friday
-
-|8:30 AM - 9:00 AM
-|
-.2+|Seminar: **Dr Ward**, Maddie Woodrow, Vandana Prabhu, Melissa Cai Shi, Jonah Hu, Mridhula Srinivasan, Michael Kruse
-|Chintan Sawla
-|Chintan Sawla
-|Ishika Kamchetty, Jackson Karshen
-|Chintan Sawla, Michael Kruse
-
-
-|9:00 AM - 9:30 AM
-|
-|Chintan Sawla
-|Chintan Sawla, Maddie Woodrow
-|Ishika Kamchetty, Jackson Karshen
-|Chintan Sawla, Michael Kruse
-
-|**Time (ET)**
-|**Sunday**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|9:30 AM - 10:00 AM
-|
-.2+|Seminar: **Dr Ward**, Rthvik Raviprakash, Jonah Hu, Bhargavi Katuru, Sumeeth Guda (last half)
-|Chintan Sawla
-|Chintan Sawla, Maddie Woodrow
-|Ishika Kamchetty, Jackson Karshen
-|Chintan Sawla, Nihar Chintamaneni
-
-|10:00 AM - 10:30 AM
-|
-|Shreyas Chickerur
-|Maddie Woodrow
-|Mridhula Srinivasan, Ishika Kamchetty
-|Maddie Woodrow, Nihar Chintamaneni
-
-|**Time (ET)**
-|**Sunday**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|10:30 AM - 11:00 AM
-|
-.2+|Seminar: **Dr Ward**, Michael Kruse, Sumeeth Guda, Ishika Kamchetty, Rthvik Raviprakash, Pramey Kabra, Bhargavi Katuru
-|Shreyas Chickerur
-|Michael Kruse, Maddie Woodrow
-|Mridhula Srinivasan, Ishika Kamchetty
-|Maddie Woodrow, Nihar Chintamaneni
-
-|11:00 AM - 11:30 AM
-|
-|Shreyas Chickerur
-|Shreyas Chickerur, Michael Kruse
-|Mridhula Srinivasan, Ishika Kamchetty
-|Ankush Maheshwari, Nihar Chintamaneni
-
-|**Time (ET)**
-|**Sunday**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|11:30 AM - 12:00 PM
-|
-|Shreyas Chickerur
-|
-|Shreyas Chickerur, Michael Kruse
-|Mridhula Srinivasan, Ishika Kamchetty
-|Ankush Maheshwari, Nihar Chintamaneni
-
-|12:00 PM - 12:30 PM
-|
-|Shreyas Chickerur
-|Ishika Kamchetty
-|Shreyas Chickerur, Michael Kruse
-|Mridhula Srinivasan, Ishika Kamchetty
-|Ankush Maheshwari, Nihar Chintamaneni
-
-|**Time (ET)**
-|**Sunday**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|12:30 PM - 1:00 PM
-|
-|Shreyas Chickerur
-|Ishika Kamchetty, Melissa Cai Shi
-|Shreyas Chickerur, Tanya Uppal
-|Rthvik Raviprakash, Sumeeth Guda
-|Vandana Prabhu, Ankush Maheshwari
-
-|1:00 PM - 1:30 PM
-|
-|Shreyas Chickerur
-|Melissa Cai Shi
-|Shreyas Chickerur, Tanya Uppal
-|Rthvik Raviprakash, Sumeeth Guda
-|Vandana Prabhu, Maddie Woodrow
-
-|**Time (ET)**
-|**Sunday**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|1:30 PM - 2:00 PM
-|
-|Nihar Chintamaneni
-|Melissa Cai Shi
-|Tanya Uppal
-|Rthvik Raviprakash, Sumeeth Guda
-|Vandana Prabhu, Maddie Woodrow
-
-|2:00 PM - 2:30 PM
-|
-|Nihar Chintamaneni
-|Mridhula Srinivasan
-|
-|Rthvik Raviprakash, Pramey Kabra
-|Jonah Hu, Maddie Woodrow
-
-|**Time (ET)**
-|**Sunday**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|2:30 PM - 3:00 PM
-|
-|Nihar Chintamaneni
-|Mridhula Srinivasan
-|Jonah Hu
-|Rthvik Raviprakash, Pramey Kabra
-|Jonah Hu, Maddie Woodrow
-
-|3:00 PM - 3:30 PM
-|
-|Nihar Chintamaneni
-|
-|Jonah Hu
-|Hyeong Park, Pramey Kabra, Keerthana Vegesna
-|Jonah Hu, Sumeeth Guda
-
-|**Time (ET)**
-|**Sunday**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|3:30 PM - 4:00 PM
-|
-|Melissa Cai Shi
-|Adrienne Zhang
-|
-|Hyeong Park, Keerthana Vegesna
-|Jonah Hu, Sumeeth Guda
-
-|4:00 PM - 4:30 PM
-|
-|Melissa Cai Shi
-|Adrienne Zhang
-|Mridhula Srinivasan, Bhargavi Katuru (online)
-|Hyeong Park
-|Jonah Hu, Sumeeth Guda
-
-|**Time (ET)**
-|**Sunday**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|4:30 PM - 5:00 PM
-|
-.2+|Seminar: **Dr Ward**, Tanya Uppal, Jackson Karshen, Keerthana Vegesna, Bhargavi Katuru
-|Adrienne Zhang
-|Mridhula Srinivasan, Bhargavi Katuru (online)
-|Hyeong Park, Pramey Kabra
-|Jonah Hu, Sumeeth Guda
-
-|5:00 PM - 5:30 PM
-|
-|Adrienne Zhang
-|Mridhula Srinivasan, Bhargavi Katuru (online)
-|Hyeong Park, Pramey Kabra
-|Tanya Uppal
-
-|**Time (ET)**
-|**Sunday**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|5:30 PM - 6:00 PM
-|
-|
-|Adrienne Zhang
-|Jackson Karshen
-|Hyeong Park, Pramey Kabra
-|Tanya Uppal, Bhargavi Katuru (online)
-
-|6:00 PM - 6:30 PM
-|
-|
-|Tanya Uppal
-|Michael Kruse
-|Jackson Karshen, Rthvik Raviprakash
-|Bhargavi Katuru, Meenu Ramakrishnan
-
-|**Time (ET)**
-|**Sunday**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|6:30 PM - 7:00 PM
-|
-|Keerthana Vegesna
-|Tanya Uppal
-|Michael Kruse
-|Jackson Karshen, Rthvik Raviprakash
-|Bhargavi Katuru, Meenu Ramakrishnan
-
-|7:00 PM - 7:30 PM
-|
-|Keerthana Vegesna
-|Tanya Uppal
-|Vandana Prabhu
-|Jackson Karshen, Rthvik Raviprakash, Ankush Maheshwari
-|Vandana Prabhu, Meenu Ramakrishnan
-
-|**Time (ET)**
-|**Sunday**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|7:30 PM - 8:00 PM
-|
-|Keerthana Vegesna
-|Adrienne Zhang
-|Vandana Prabhu, Keerthana Vegesna
-|Jackson Karshen, Meenu Ramakrishnan, Ankush Maheshwari
-|Vandana Prabhu, Meenu Ramakrishnan
-
-|8:00 PM - 8:30 PM
-|
-|Hyeong Park
-|Adrienne Zhang
-|Chintan Sawla, Keerthana Vegesna
-|Jackson Karshen, Meenu Ramakrishnan, Ankush Maheshwari
-|Vandana Prabhu, Meenu Ramakrishnan
-
-|**Time (ET)**
-|**Sunday**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|8:30 PM - 9:00 PM
-|
-|Hyeong Park
-|Adrienne Zhang
-|Chintan Sawla, Keerthana Vegesna
-|Jackson Karshen, Meenu Ramakrishnan, Ankush Maheshwari
-|Meenu Ramakrishnan, Nihar Chintamaneni
-
-|9:00 PM - 9:30 PM
-|
-|Hyeong Park
-|Adrienne Zhang
-|Pramey Kabra, Chintan Sawla, Keerthana Vegesna
-|Ankush Maheshwari, Meenu Ramakrishnan
-|Meenu Ramakrishnan, Nihar Chintamaneni
-
-|**Time (ET)**
-|**Sunday**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|9:30 PM - 10:00 PM
-|
-|Hyeong Park
-|Adrienne Zhang
-|Pramey Kabra, Keerthana Vegesna
-|Ankush Maheshwari, Sumeeth Guda
-|Melissa Cai Shi, Meenu Ramakrishnan
-
-|10:00 PM - 10:30 PM
-|
-|Hyeong Park
-|Adrienne Zhang
-|Pramey Kabra
-|Ankush Maheshwari, Sumeeth Guda
-|Melissa Cai Shi
-
-|**Time (ET)**
-|**Sunday**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|10:30 PM - 11:00 PM
-|
-|Hyeong Park
-|Adrienne Zhang
-|Pramey Kabra
-|Ankush Maheshwari
-|Melissa Cai Shi
-|===
-
-
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/logistics/29000-f2021-officehours.adoc b/projects-appendix/modules/ROOT/pages/fall2021/logistics/29000-f2021-officehours.adoc
deleted file mode 100644
index c50389dfc..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/logistics/29000-f2021-officehours.adoc
+++ /dev/null
@@ -1,321 +0,0 @@
-= STAT 29000 and 39000 Office Hours for Fall 2021
-
-It might be helpful to also have the office hours for STAT 19000:
-
-xref:logistics/19000-f2021-officehours.adoc[STAT 19000 Office Hours for Fall 2021]
-
-and it might be helpful to look at the
-xref:logistics/officehours.adoc[general office hours policies].
-
-The STAT 29000 and 39000 office hours and WebEx addresses are the following:
-
-Webex addresses for TAs, Dr Ward, and Kevin Amstutz
-
-[cols="2,1,4"]
-|===
-|TA Name |Class |Webex chat room URL
-
-|Dr Ward (seminars)
-|all
-|https://purdue.webex.com/meet/mdw
-
-|Kevin Amstutz
-|all
-|https://purdue.webex.com/meet/kamstut
-
-|Jacob Bagadiong
-|29000
-|https://purdue-student.webex.com/meet/jbagadio
-
-|Darren Iyer
-|29000
-|https://purdue-student.webex.com/meet/iyerd
-
-|Rishabh Rajesh
-|29000
-|https://purdue-student.webex.com/meet/rajeshr
-
-|Haozhe Zhou
-|29000
-|https://purdue-student.webex.com/meet/zhou929
-
-|Nikhil D'Souza
-|39000
-|https://purdue-student.webex.com/meet/dsouza13
-|===
-
-[cols="1,1,1,1,1,1"]
-|===
-|Time (ET) |Monday |Tuesday |Wednesday |Thursday |Friday
-
-|8:00 AM - 9:00 AM
-|
-|
-|
-|Jacob Bagadiong
-|
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|8:30 AM - 9:00 AM
-.2+|Seminar: **Dr Ward**, Haozhe Zhou
-|
-|
-|Jacob Bagadiong
-|Rishabh Rajesh
-
-|9:00 AM - 9:30 AM
-|
-|
-|Jacob Bagadiong
-|Rishabh Rajesh
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|9:30 AM - 10:00 AM
-.2+|Seminar: **Dr Ward**, Haozhe Zhou, Nikhil D'Souza
-|
-|
-|Jacob Bagadiong
-|Haozhe Zhou
-
-|10:00 AM - 10:30 AM
-|
-|Darren Iyer
-|
-|Haozhe Zhou
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|10:30 AM - 11:00 AM
-.2+|Seminar: **Dr Ward**, Haozhe Zhou
-|
-|Darren Iyer
-|
-|Haozhe Zhou
-
-|11:00 AM - 11:30 AM
-|
-|Darren Iyer
-|
-|Haozhe Zhou
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|11:30 AM - 12:00 PM
-|
-|
-|
-|
-|Nikhil D'Souza (WebEx)
-
-|12:00 PM - 12:30 PM
-|
-|
-|
-|
-|Nikhil D'Souza (WebEx)
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|12:30 PM - 1:00 PM
-|
-|
-|
-|
-|Nikhil D'Souza (WebEx)
-
-|1:00 PM - 1:30 PM
-|
-|
-|
-|
-|Nikhil D'Souza (WebEx)
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|1:30 PM - 2:00 PM
-|
-|
-|
-|
-|
-
-|2:00 PM - 2:30 PM
-|
-|Jacob Bagadiong
-|Rishabh Rajesh
-|Darren Iyer
-|
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|2:30 PM - 3:00 PM
-|
-|Jacob Bagadiong
-|Rishabh Rajesh
-|Darren Iyer
-|
-
-|3:00 PM - 3:30 PM
-|
-|Jacob Bagadiong
-|Rishabh Rajesh
-|
-|Darren Iyer
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|3:30 PM - 4:00 PM
-|
-|Jacob Bagadiong
-|Rishabh Rajesh
-|
-|Darren Iyer
-
-|4:00 PM - 4:30 PM
-|
-|
-|Rishabh Rajesh
-|
-|Darren Iyer
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|4:30 PM - 5:00 PM
-.2+|Seminar: **Dr Ward**, Jacob Bagadiong
-|
-|Rishabh Rajesh
-|
-|Darren Iyer
-
-|5:00 PM - 5:30 PM
-|
-|
-|
-|Darren Iyer
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|5:30 PM - 6:00 PM
-||
-|
-|
-|Darren Iyer
-
-
-|6:00 PM - 6:30 PM
-|Nikhil D'Souza
-|Nikhil D'Souza
-|Jacob Bagadiong
-|
-|
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|6:30 PM - 7:00 PM
-|Nikhil D'Souza
-|Nikhil D'Souza
-|Jacob Bagadiong
-|
-|
-
-|7:00 PM - 7:30 PM
-|Nikhil D'Souza
-|Nikhil D'Souza
-|
-|Rishabh Rajesh
-|
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|7:30 PM - 8:00 PM
-|
-|
-|
-|Rishabh Rajesh
-|
-
-|8:00 PM - 8:30 PM
-|
-|
-|
-|Rishabh Rajesh
-|
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|8:30 PM - 9:00 PM
-|
-|
-|
-|Rishabh Rajesh
-|
-|===
-
-
diff --git a/projects-appendix/modules/ROOT/pages/fall2021/logistics/39000-f2021-officehours.adoc b/projects-appendix/modules/ROOT/pages/fall2021/logistics/39000-f2021-officehours.adoc
deleted file mode 100644
index c50389dfc..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2021/logistics/39000-f2021-officehours.adoc
+++ /dev/null
@@ -1,321 +0,0 @@
-= STAT 29000 and 39000 Office Hours for Fall 2021
-
-It might be helpful to also have the office hours for STAT 19000:
-
-xref:logistics/19000-f2021-officehours.adoc[STAT 19000 Office Hours for Fall 2021]
-
-and it might be helpful to look at the
-xref:logistics/officehours.adoc[general office hours policies].
-
-The STAT 29000 and 39000 office hours and WebEx addresses are the following:
-
-Webex addresses for TAs, Dr Ward, and Kevin Amstutz
-
-[cols="2,1,4"]
-|===
-|TA Name |Class |Webex chat room URL
-
-|Dr Ward (seminars)
-|all
-|https://purdue.webex.com/meet/mdw
-
-|Kevin Amstutz
-|all
-|https://purdue.webex.com/meet/kamstut
-
-|Jacob Bagadiong
-|29000
-|https://purdue-student.webex.com/meet/jbagadio
-
-|Darren Iyer
-|29000
-|https://purdue-student.webex.com/meet/iyerd
-
-|Rishabh Rajesh
-|29000
-|https://purdue-student.webex.com/meet/rajeshr
-
-|Haozhe Zhou
-|29000
-|https://purdue-student.webex.com/meet/zhou929
-
-|Nikhil D'Souza
-|39000
-|https://purdue-student.webex.com/meet/dsouza13
-|===
-
-[cols="1,1,1,1,1,1"]
-|===
-|Time (ET) |Monday |Tuesday |Wednesday |Thursday |Friday
-
-|8:00 AM - 9:00 AM
-|
-|
-|
-|Jacob Bagadiong
-|
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|8:30 AM - 9:00 AM
-.2+|Seminar: **Dr Ward**, Haozhe Zhou
-|
-|
-|Jacob Bagadiong
-|Rishabh Rajesh
-
-|9:00 AM - 9:30 AM
-|
-|
-|Jacob Bagadiong
-|Rishabh Rajesh
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|9:30 AM - 10:00 AM
-.2+|Seminar: **Dr Ward**, Haozhe Zhou, Nikhil D'Souza
-|
-|
-|Jacob Bagadiong
-|Haozhe Zhou
-
-|10:00 AM - 10:30 AM
-|
-|Darren Iyer
-|
-|Haozhe Zhou
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|10:30 AM - 11:00 AM
-.2+|Seminar: **Dr Ward**, Haozhe Zhou
-|
-|Darren Iyer
-|
-|Haozhe Zhou
-
-|11:00 AM - 11:30 AM
-|
-|Darren Iyer
-|
-|Haozhe Zhou
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|11:30 AM - 12:00 PM
-|
-|
-|
-|
-|Nikhil D'Souza (WebEx)
-
-|12:00 PM - 12:30 PM
-|
-|
-|
-|
-|Nikhil D'Souza (WebEx)
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|12:30 PM - 1:00 PM
-|
-|
-|
-|
-|Nikhil D'Souza (WebEx)
-
-|1:00 PM - 1:30 PM
-|
-|
-|
-|
-|Nikhil D'Souza (WebEx)
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|1:30 PM - 2:00 PM
-|
-|
-|
-|
-|
-
-|2:00 PM - 2:30 PM
-|
-|Jacob Bagadiong
-|Rishabh Rajesh
-|Darren Iyer
-|
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|2:30 PM - 3:00 PM
-|
-|Jacob Bagadiong
-|Rishabh Rajesh
-|Darren Iyer
-|
-
-|3:00 PM - 3:30 PM
-|
-|Jacob Bagadiong
-|Rishabh Rajesh
-|
-|Darren Iyer
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|3:30 PM - 4:00 PM
-|
-|Jacob Bagadiong
-|Rishabh Rajesh
-|
-|Darren Iyer
-
-|4:00 PM - 4:30 PM
-|
-|
-|Rishabh Rajesh
-|
-|Darren Iyer
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|4:30 PM - 5:00 PM
-.2+|Seminar: **Dr Ward**, Jacob Bagadiong
-|
-|Rishabh Rajesh
-|
-|Darren Iyer
-
-|5:00 PM - 5:30 PM
-|
-|
-|
-|Darren Iyer
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|5:30 PM - 6:00 PM
-||
-|
-|
-|Darren Iyer
-
-
-|6:00 PM - 6:30 PM
-|Nikhil D'Souza
-|Nikhil D'Souza
-|Jacob Bagadiong
-|
-|
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|6:30 PM - 7:00 PM
-|Nikhil D'Souza
-|Nikhil D'Souza
-|Jacob Bagadiong
-|
-|
-
-|7:00 PM - 7:30 PM
-|Nikhil D'Souza
-|Nikhil D'Souza
-|
-|Rishabh Rajesh
-|
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|7:30 PM - 8:00 PM
-|
-|
-|
-|Rishabh Rajesh
-|
-
-|8:00 PM - 8:30 PM
-|
-|
-|
-|Rishabh Rajesh
-|
-
-|**Time (ET)**
-|**Monday**
-|**Tuesday**
-|**Wednesday**
-|**Thursday**
-|**Friday**
-
-|8:30 PM - 9:00 PM
-|
-|
-|
-|Rishabh Rajesh
-|
-|===
-
-
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project01.adoc
deleted file mode 100644
index a043f1434..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project01.adoc
+++ /dev/null
@@ -1,314 +0,0 @@
-= TDM 10100: Project 1 -- 2022
-
-**Motivation:** In this project we are going to jump head first into The Data Mine. We will load datasets into the R environment, and introduce some core programming concepts like variables, vectors, types, etc. As we will be "living" primarily in an IDE called Jupyter Lab, we will take some time to learn how to connect to it, configure it, and run code.
-
-**Context:** This is our first project as a part of The Data Mine. We will get situated, configure the environment we will be using throughout our time with The Data Mine, and jump straight into working with data!
-
-**Scope:** r, Jupyter Lab, Anvil
-
-.Learning Objectives
-****
-- Read about and understand computational resources available to you.
-- Learn how to run R code in Jupyter Lab on Anvil.
-- Read and write basic (csv) data using R.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/flights/subset/1991.csv`
-- `/anvil/projects/tdm/data/movies_and_tv/imdb.db`
-- `/anvil/projects/tdm/data/disney/flight_of_passage.csv`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-++++
-
-++++
-
-For this course, projects will be solved using the https://www.rcac.purdue.edu/compute/anvil[Anvil computing cluster].
-
-Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (RAM). Use the information on the provided webpages to calculate how many cores and how much memory is available _in total_ for the Anvil "sub-clusters".
-
-Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer of your own, work with a friend to see how many cores there are, and how much memory is available, on their computer.
-
-.Items to submit
-====
-- A sentence explaining how many cores and how much memory is available, in total, across all nodes in the sub-clusters on Anvil.
-- A sentence explaining how many cores and how much memory is available, in total, for your own computer.
-====
-
-=== Question 2
-
-We will be using Jupyter Lab on the Anvil cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster.
-
-Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (and Duo). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward.
-
-[TIP]
-====
-If you did not (yet) setup your 2-factor authentication credentials with Duo, you can go back to Step 9 and setup the credentials here: https://the-examples-book.com/starter-guides/anvil/access-setup
-====
-
-Towards the middle of the top menu, there will be an item labeled btn:[My Interactive Sessions], click on btn:[My Interactive Sessions]. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Bioinformatics, Interactive Apps, and The Data Mine. Under "The Data Mine" section, you should see a button that says btn:[Jupyter Notebook], click on btn:[Jupyter Notebook].
-
-If everything was successful, you should see a screen similar to the following.
-
-image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"]
-
-Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 2 CPU cores and 3800 Mb.
-
-[NOTE]
-====
-If you select 4000 Mb of memory instead of 3800 Mb, you will end up getting 3 CPU cores instead of 2. OnDemand tries to balance the memory to CPU ratio to be _about_ 1900 Mb per CPU core.
-====
-
-We use the Anvil cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine.
-
-After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on btn:[Connect to Jupyter] to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following.
-
-image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"]
-
-There are 2 primary options that you will need to know about.
-
-f2022-s2023::
-The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment.
-
-[TIP]
-====
-To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page].
-====
-
-f2022-s2023-r::
-An alternative, native R kernel that you can use for projects with _just_ R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell.
-
-For now, let's focus on the f2022-s2023 kernel. Click on btn:[f2022-s2023], and a fresh notebook will be created for you.
-
-[NOTE]
-====
-Soon, we'll have the f2022-s2023-r kernel available and ready to use!
-====
-
-Test it out! Run the following code in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on. What is the name of the node on Anvil that you are running on?
-
-[source,r]
-----
-%%R
-
-system("hostname", intern=TRUE)
-----
-
-[TIP]
-====
-To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu.
-====
-
-.Items to submit
-====
-- Code used to solve this problem in a "code" cell.
-- Output from running the code (the name of the node on Anvil that you are running on).
-====
-
-=== Question 3
-
-++++
-
-++++
-
-++++
-
-++++
-
-In the upper right-hand corner of your notebook, you will see the current kernel for the notebook, `f2022-s2023`. If you click on this name you will have the option to swap kernels out -- no need to do this yet, but it is good to know!
-
-Practice running the following examples.
-
-python::
-[source,python]
-----
-my_list = [1, 2, 3]
-print(f'My list is: {my_list}')
-----
-
-SQL::
-[source, sql]
-----
-%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db
-----
-
-[source, ipython]
-----
-%%sql
-
-SELECT * FROM titles LIMIT 5;
-----
-
-[NOTE]
-====
-In a previous semester, you'd need to load the sql extension first -- this is no longer needed as we've made a few improvements!
-
-[source,ipython]
-----
-%load_ext sql
-----
-====
-
-bash::
-[source,bash]
-----
-%%bash
-
-awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' /anvil/projects/tdm/data/flights/subset/1991.csv
-----
-
-[TIP]
-====
-To learn more about how to run various types of code using this kernel, see https://the-examples-book.com/projects/templates[our template page].
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-This year, the first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`).
-
-Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`.
-
-There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default?
-
-Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A.
-
-.Items to submit
-====
-- How many of each types of cells are there in the default template?
-====
-
-=== Question 5
-
-++++
-
-++++
-
-In question (1) we answered questions about cores and memory for the Anvil clusters. To do so, we needed to perform some arithmetic. Instead of using a calculator (or paper, or mental math for you good-at-mental-math folks), write these calculations using R _and_ Python, in separate code cells.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 6
-
-++++
-
-++++
-
-++++
-
-++++
-
-In the previous question, we ran our first R and Python code (aside from _provided_ code). In the fall semester, we will focus on learning R. In the spring semester, we will learn some Python. Throughout the year, we will always be focused on working with data, so we must learn how to load data into memory. Load your first dataset into R by running the following code.
-
-[source,ipython]
-----
-%%R
-
-dat <- read.csv("/anvil/projects/tdm/data/disney/flight_of_passage.csv")
-----
-
-Confirm that the dataset has been read in by passing the dataset, `dat`, to the `head()` function. The `head` function will return the first 5 rows of the dataset.
-
-[source,r]
-----
-%%R
-
-head(dat)
-----
-
-[IMPORTANT]
-====
-Remember -- if you are in a _new_ code cell, you'll need to add `%%R` to the top of the code cell, otherwise, Jupyter will try to run your R code using the _Python_ interpreter -- that would be no good!
-====
-
-`dat` is a variable that contains our data! We can name this variable anything we want. We do _not_ have to name it `dat`; we can name it `my_data` or `my_data_set`.
-
-Run our code to read in our dataset, this time, instead of naming our resulting dataset `dat`, name it `flight_of_passage`. Place all of your code into a new cell. Be sure to include a level 2 header titled "Question 6", above your code cell.
-
-[TIP]
-====
-In markdown, a level 2 header is any line starting with 2 hashtags. For example, `Question X` with two hashtags beforehand is a level 2 header. When rendered, this text will appear much larger. You can read more about markdown https://guides.github.com/features/mastering-markdown/[here].
-====
-
-[NOTE]
-====
-We didn't need to re-read in our data in this question to make our dataset be named `flight_of_passage`. We could have re-named `dat` to be `flight_of_passage` like this.
-
-[source,r]
-----
-flight_of_passage <- dat
-----
-
-Some of you may think that this isn't exactly what we want, because we are copying over our dataset. You are right, this is certainly _not_ what we want! What if it was a 5Gb dataset, that would be a lot of wasted space! Well, R does copy on modify. What this means is that until you modify either `dat` or `flight_of_passage` the dataset isn't copied over. You can therefore run the following code to remove the other reference to our dataset.
-
-[source,r]
-----
-rm(dat)
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 7
-
-++++
-
-++++
-
-Let's pretend we are now done with the project. We've written some code, maybe added some markdown cells to explain what we did, and we are ready to submit our assignment. For this course, we will turn in a variety of files, depending on the project.
-
-We will always require a Jupyter Notebook file. Jupyter Notebook files end in `.ipynb`. This is our "source of truth" and what the graders will turn to first when grading.
-
-[WARNING]
-====
-You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this.
-
-You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this.
-====
-
-A `.ipynb` file is generated by first running every cell in the notebook, and then clicking the "Download" button from menu:File[Download].
-
-In addition to the `.ipynb`, if a project uses R code, you will need to also submit R code in an R script. An R script is just a text file with the extension `.R`. When submitting Python code, you will need to also submit a Python script. A Python script is just a text file with the extension `.py`.
-
-Let's practice. Take the R code from this project and copy and paste it into a text file with the `.R` extension. Call it `firstname-lastname-project01.R`. Next, take the Python code from this project and copy and paste it into a text file with the `.py` extension. Call it `firstname-lastname-project01.py`. Download your `.ipynb` file -- making sure that the output from all of your code is present and in the notebook (the `.ipynb` file will also be referred to as "your notebook" or "Jupyter notebook").
-
-Once complete, submit your notebook, R script, and Python script.
-
-.Items to submit
-====
-- `firstname-lastname-project01.R`.
-- `firstname-lastname-project01.py`.
-- `firstname-lastname-project01.ipynb`.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project02.adoc
deleted file mode 100644
index 356f71284..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project02.adoc
+++ /dev/null
@@ -1,265 +0,0 @@
-= TDM 10100: Project 2 -- 2022
-Introduction to R part I
-
-The Rfootnote:[R is case sensitive] environment is a powerful tool to perform data analysis. R's language is often compared to Python. Both languages have their advantages and disadvantages, and both are worth learning.
-
-In this project we will dive in head first and learn some of the basics while solving data-driven problems.
-
-
-.5 basic types of data
-[%collapsible]
-====
- * Values like 1.5 are called numeric values, real numbers, decimal numbers, etc.
- * Values like 7 are called integers or whole numbers.
- * Values TRUE or FALSE are called logical values or Boolean values.
- * Texts consist of sequences of words (also called strings), and words consist of sequences of characters.
- * Values such as 3 + 2ifootnote:[https://stat.ethz.ch/R-manual/R-devel/library/base/html/complex.html] are called complex numbers. We usually do not encounter these in The Data Mine.
-====
-
-
-
-[NOTE]
-====
-R and Python both have their advantages and disadvantages. A key part of learning data science methods is to understand the situations in which R is a more helpful tool to use, or Python is a more helpful tool to use. Both of them are good for their own purposes. In a similar way, hammers and screwdrivers and drills and many other tools are useful for construction, but they all have their own individual purposes.
-
-In addition, there are many other languages and tools, e.g., https://julialang.org/[Julia] and https://www.rust-lang.org/[Rust] and https://go.dev/[Go] and many other languages are emerging as relatively newer languages that each have their own advantages.
-====
-
-**Context:** In the last project we set the stage for the rest of the semester. We got some familiarity with our project templates, and modified and ran some examples.
-
-In this project, we will continue to use R within Jupyter Lab to solve problems. Soon, you will see how powerful R is and why it is often more effective than using spreadsheets as a tool for data analysis.
-
-**Scope:** xref:programming-languages:R:index.adoc[r], xref:programming-languages:R:lists-and-vectors.adoc[vectors, lists], indexing
-
-.Learning Objectives
-****
-- Be aware of the different concepts and when to apply them; such as lists, vectors, factors, and data.frames
-
-- Be able to explain and demonstrate: positional, named, and logical indexing.
-- Read and write basic (csv) data using R.
-- Identify good and bad aspects of simple plots.
-
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-[source,r]
-----
-myDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1995.csv", stringsAsFactors = TRUE)
-----
-
-== ONE
-
-++++
-
-++++
-
-The data that we may be working with does not always come to us neat and cleanfootnote:["Raw data" vs "Clean data". Some datasets require "cleaning" such as removing duplicates, removing null values and disgarding irrelevent data]. It is important to get a good understanding of the dataset(s) with which you are working. This is the best first step to help solve any data-driven problems.
-
-.Insider Knowledge
-[%collapsible]
-====
-Datasets can be thought or as one or more observations of one or more variables. For most datasets, each row is an observation and each column is a variable. (There may be some datasets do not follow that convention.)
-====
-
-We are going to use the `read.csv` function to load our datasets into a dataframe named ... +
-We want to use functions such as `head`, `tail`, `dim`, `summary`, `str`, `class`, to get a better understanding of our dataframe(DF).
-
-.Helpful Hints
-[%collapsible]
-====
-[source,r]
-----
-#looks at the head of the dataframe
-head(myDF)
-#looks at the tail of the dataframe
-tail(myDF)
-#returns the type of data in a column of the dataframe, for instance, the type of data in the column that stores the destination airports of the flights
-class(myDF$Dest)
-----
-====
-[loweralpha]
-.. How many columns does this dataframe have?
-.. How many rows does this dataframe have?
-.. What type/s of data are in this dataframe (example: numerical values, and/or text strings, etc.)
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The answers to the three questions above
-====
-
-== TWO
-We can create a new vectorfootnote:[https://sudo-labs.github.io/r-data-science/vectors/] containing all of the origin airports (i.e., the airports where the flights departed) from the column `myDF$Origin` of the data frame `myDF`.
-[source,r]
-----
-#takes the selected information from the dataframe and puts it into a new vector called `myairports`
-myairports <- myDF$Origin
-----
-
-.Insider Knowledge
-[%collapsible]
-====
-A vector is a simple way to store a sequence of data. The data can be numeric data, logical data, textual data, etc.
-====
-To assist with this question, please also see the end of the video from Question 1 (above).
-[loweralpha]
-.. What type of data is in the vector `myairports`?
-.. The vector `myairports` contains all of the airports where flights departed in 1995. Print the first 250 of those airports. [Do not print all of the airports, because there are 5327435 such values!] How many of the first 250 flights departed from O'Hare?
-.. How many flights departed by O'Hare altogether in 1995?
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The answers to the 3 questions above.
-====
-
-== THREE
-
-++++
-
-++++
-
-Indexing
-
-.Insider Knowledge
-[%collapsible]
-====
-Accessing data can be done in many ways, one of those ways is called **_indexing_**. Typically we use brackets **[ ]** when indexing. By doing this we can select or even exclude specific elements. For example we can select a specific column and a certian range within the column. Some examples of symbols to help us select elements include: +
- * < less than +
- * > greater than +
- * <= less than or equal to +
- * >= greater than or equal to +
- * == is equal +
- * != is not equal +
-It is also important to note that indexing in R begins at 1. (This means that the first row of the dataframe will be numbered starting at 1.)
-====
-.Helpful Hints
-[%collapsible]
-====
-[source,r]
-----
-#finding data by their indices
-myDF$Distance[row_index_start:row_index_end,]
-#creates a new vector with the specific info
-mynewvector <- myDF$putcolumnnamehere
-#all of the data from row 3
-myDF[3,]
-#all of the data in all of the rows, with columns between myfirstcolumn and mylastcolumn
-myDF[,myfirstcolumn:mylastcolumn]
-#and/or
-#the first 250 values from column 17
-head(myDF[,17], n=250)
-#puts all variables that are less than 6 from the dataframe
-longdistances = myDF$Distance[myDF$Distance > 2000]
-----
-====
-[loweralpha]
-.. How many flights departed from Indianapolis (`IND`) in 1995? How many flights landed there?
-.. Consider the flight data from row 894 the data frame. What airport did it depart from? Where did it arrive?
-.. How many flights have a distance of less than 200 miles?
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The answers to the 3 questions above.
-====
-
-== FOUR
-
-++++
-
-++++
-
-Summarizing vectors using tables +
-
-The `table` command is helpful to know, for summarizing large quantities of data.
-
-
-.Insider Knowledge
-[%collapsible]
-====
-It is useful to use functions in R and see how they behave, and then to take a function of the result, and take a function of that result, etc. For instance, it is common to summarize a vector in a table, and then sort the results, and then take the first few largest or smallest values.
-Remember also that R is a case-sensitive language.
-[source,r]
-----
-table(myDF$Origin) # summarizes how many flights departed from each airport
-sort(table(myDF$Origin)) # sorts those results in numeric order
-tail(sort(table(myDF$Origin)),n=10) # finds the 10 most popular airports, according to the number of flights that departed from each airport.
-----
-
-====
-[loweralpha]
-.. Rank the airline companies (in the column `myDF$UniqueCarrier`) according to their popularity, i.e., according to the number of flights on each airline).
-.. Which are the three most popular airlines from 1995?
-.. Now find the ten airplanes that had the most flights in 1995. List them in order, from most popular to least popular. Do you notice anything unusual about the results?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The answers to the 3 questions above.
-====
-
-== FIVE
-
-++++
-
-++++
-
-Basic graph types are helpful for visualizing data. They can be an important tool in discovering insights into the data you are working with. +
-R has a number of tools built in for basic graphs, such as scatter plots, bar charts, histograms, etc.
-
-.Insider Knowledge
-[%collapsible]
-====
-A dot plot, also known as a dot chart, is similar to a bar chart or a scatter plot. In R, the categories are displayed along the vertical axis and the corresponding values are displayed according to the horizontal axis. +
-
-We can assign groups a color to help differentiate while plotting a dot chart +
-
-We can also plot a column that we find interesting as well to take a look at what the data might show us.
-For example if we wanted to see if there was a difference in days of the week and number of flights, we would use `hist`.
-[source,r]
-----
-mydays<- myDF$DayOfWeek
-hist(mydays)
-----
-
-====
-
-.Helpful Hints
-[%collapsible]
-====
-[source,r]
-----
-mycities <- tail(sort(table(myDF$Origin)),n=10)
-dotchart(mycities, pch = 21, bg = "green", pt.cex = 1.5)
-----
-====
-[loweralpha]
-.. Pick a column of data that you are interested in studying, or a question that you want answered. Create either a `plot`, or a `dotchart`. Before making the plot, think about how many dots will be displayed on your `plot` or `dotchart`. If you try to display millions of dots, you might cause your Jupyter Lab session to freeze or crash. It is useful to think ahead and to consider how your plot might look, before you accidentally try to display millions of dots.
-.. Descibe any patterns you may see in your plot or your dotchart. If there are none, that is okay, and you can just write "there seem to be no patterns."
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The plot or dotchart and your commentary about what you created and what you observed.
-====
-
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project03.adoc
deleted file mode 100644
index 1601bc66a..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project03.adoc
+++ /dev/null
@@ -1,241 +0,0 @@
-= TDM 10100: Project 3 -- Fall 2022
-Inroduction to R part II
-
-**Motivation:** `data.frames` are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a `data.frame`.
-
-**Context:** In the previous project we, ran our first R code, and learned about accessing data inside vectors. In this project we will continue to reinforce what we've already learned and introduce a new, flexible data structure called `data.frame`s.
-
-**Scope:** r, data.frames, recycling, factors
-
-.Learning Objectives
-****
-- - Explain what "recycling" is in R and predict behavior of provided statements.
-- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- List the differences between lists, vectors, factors, and data.frames, and when to use each.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-++++
-
-++++
-
-[TIP]
-====
-As described in the video above, Dr Ward is using:
-
-`options(jupyter.rich_display = F)`
-
-so that the work using the kernel `f2022-s2023-r` looks similar to the work using the kernel `f2022-s2023`. We will probably make this option permanent in the future, but I just wanted to point this out. You do not have to do this, but I like the way it the output looks with this option.
-====
-
-Using the *f2022-s2023-r* kernel,
-lets first see all of the files that are in the Disney folder
-[source,r]
-----
-list.files("/anvil/projects/tdm/data/disney")
-----
-
-After looking at several of the files we will go ahead and read in the data frame on the 7 Dwarfs Train.
-[source,r]
-----
-myDF <- read.csv("/anvil/projects/tdm/data/disney/7_dwarfs_train.csv", stringsAsFactors = TRUE)
-----
-
-If we want to see the file size (aka how large) of the CSV.
-[source,r]
-----
-file.info("/anvil/projects/tdm/data/disney/7_dwarfs_train.csv")$size
-----
-You can also use `file.info` to see other information about the file.
-
-.Insider Knowledge
-[%collapsible]
-====
-*size*- double: File size in bytes. +
-isdir- logical: Is the file a directory? +
-*mode*- integer of class "octmode". The file permissions, printed in octal, for example 644. +
-*mtime, ctime, atime*- integer of class "POSIXct": file modification, ‘last status change’ and last access times. +
-*uid*- integer: the user ID of the file's owner. +
-*gid*- integer: the group ID of the file's group. +
-*uname*- character: uid interpreted as a user name.
-grname +
-character: gid interpreted as a group name. Unknown user and group names will be NA.
-====
-
-=== ONE
-
-++++
-
-++++
-
-Familiarizing yourself with the data.
-
-.Helpful Hint
-[%collapsible]
-====
-You can look at the first 6 rows (`head`) and the last 6 rows (`tail`). The structure (`str`) and/or the dimentions (`dim`) of the dataset. +
-
-*"SACTMIN"* is the actual minutes that a person waited in line +
-*"SPOSTMIN"* is the time about the ride, estimating the wait time. (Any value that is -999 means that the ride was not in service) +
-*"datetime"* is the date and time the information was recorded +
-*"date"* is the date of the event
-====
-
-In the last project we learned about how to look at the data.frame. Based on that, write 1-2 sentences describing the dataset (how many rows, how many columns, the type of data, etc.) and what it holds. Use the head command to look at the first 21 rows.
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- 1-2 sentences explaining the dataset.
-====
-
-=== TWO
-
-++++
-
-++++
-
-Now that we have a better understanding of content and structure of our data. We are diving a bit deeper and making connections within the data.
-
-[loweralpha]
-.. If we are looking at the column *"SPOSTMIN"* what do you notice about the increments of time? I.e., is there anything special about the types of values that appear? How many different wait time options do you see in *"SPOSTMIN"*?
-.. How many `NA` values do you see in *"SPOSTMIN"*?
-.. Create a new data frame with the name `newDF` in which the *"SPOSTMIN"* column has all `NA` values removed. In other words, select the rows of `myDF` for which *"SPOSTMIN"* is not `NA` and call the resulting `data.frame` by the name `newDF`.
-
-.Insider Knowledge
-[%collapsible]
-====
-`na.omit` and `na.exclude` returns objects with the observations removed if they contain any missing values. As well as performs calculations by considering the NA values but does not include them in the calculation. +
-`na.rm` first [.underline]#removes the NA values and then# does the calculation. +
-`na.pass` returns the object unchanged +
-It is also possible to use the `subset` function and the `is.na` function.
-====
-
-.Helpful Hint
-[%collapsible]
-====
-Use the code below
-[source,r]
-----
-table(myDF$SPOSTMIN)
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The answer to the 3 questions above.
-====
-=== THREE
-
-++++
-
-++++
-
-++++
-
-++++
-
-Use the `myDF` data.frame for this question.
-[loweralpha]
-.. On Christmas day, what was the average wait time? On July 26th, what was the average wait time?
-.. Is there a difference between the wait times in the summer and the holidays?
-.. On which date do the most entries occur in the data set?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The answer to the 3 questions above.
-====
-
-==== FOUR
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-Recycling in R +
-
-.Insider Knowledge
-[%collapsible]
-====
-Recycling happens in R automatically. When you are attempting to preform operations like addition, subtraction on two vectors of unequal length. +
-The shorter vector will be repeated as long as the operation is completing on the longer vector.
-====
-
-[loweralpha]
-.. Find the lengths of the column *"SPOSTMIN"* in the `myDF` and `newDF`.
-.. Create a new vector called `myhours` by adding together *"SPOSTMIN"* columns from `myDF` and `newDF` with each divided by 60. What is the length of that new vector `myhours`?
-.. What happened in row 313997? Why?
-
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The answers to the 3 questions above.
-====
-
-
-==== FIVE
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-Indexing and Expanding dataframes in R
-
-[source,r]
-----
-library(lubridate)
-myDF$weekday <- wday(myDF$datetime, label=TRUE)
-----
-
-[loweralpha]
-.. Consider the average wait times. What day of the week in `myDF` has the longest average wait times?
-.. Make a plot and a dotchart that illustrate the data for the average wait times. Which one conveys the information better and why?
-.. We created a new column in `myDF` that shows the weekdays. Do the same thing for part (a) and (b) again, but this time using the months instead of the days of the week.
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The answers to the 3 questions above.
-====
-
-
-
-
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project04.adoc
deleted file mode 100644
index fbdd77d1b..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project04.adoc
+++ /dev/null
@@ -1,262 +0,0 @@
-= TDM 10100: Project 4 -- Fall 2022
-Introduction to R part III
-
-++++
-
-++++
-
-Many data science tools including xref:programming-languges:R:introduction[R] have powerful ways to index data.
-
-.Insider Knowledge
-[%collapsible]
-====
-R typically has operations that are vectorized and there is little to no need to write loops. +
-R typically also uses indexing instead of using an if statement.
-
-* Sequential statements (one after another) i.e. +
-1. print line 45 +
-2. print line 15 +
-
-**if/else statements**
- create an order of direction based on a logical condition. +
-
-if statement example:
-[source,r]
-----
-x <- 7
-if (x > 0){
-print ("Positive number")
-}
-----
-else statement example:
-[source,r]
-----
-x <- -10
-if(x >= 0){
-print("Non-negative number")
-} else {
-print("Negative number")
-}
-----
-In `R`, we can classify many numbers all at once:
-[source,r]
-----
-x <- c(-10,3,1,-6,19,-3,12,-1)
-mysigns <- rep("Non-negative number", times=8)
-mysigns[x < 0] <- "Negative number"
-mysigns
-----
-
-====
-**Context:** As we continue to become more familiar with `R` this project will help reinforce the many ways of indexing data in `R`.
-
-**Scope:** r, data.frames, indexing.
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-
-Using the *f2022-s2023-r* kernel
-Lets first see all of the files that are in the `craigslist` folder
-[source,r]
-----
-list.files("/anvil/projects/tdm/data/craigslist")
-----
-
-After looking at several of the files we will go ahead and read in the data frame on the Vehicles
-[source,r]
-----
-myDF <- read.csv("/anvil/projects/tdm/data/craigslist/vehicles.csv", stringsAsFactors = TRUE)
-----
-
-.Helpful Hints
-[%collapsible]
-====
-Remember: +
-
-* If we want to see the file size (aka how large) of the CSV.
-[source,r]
-----
-file.info("/anvil/projects/tdm/data/craigslist/vehicles.csv")$size
-----
-
-* You can also use 'file.info' to see other information about the file.
-====
-
-=== ONE
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-It is so important that, each time we look at data, we start by becoming familiar with the data. +
-In past projects we have looked at the head/tail along with the structure and the dimensions of the data. We want to continue this practice.
-
-This dataset has 25 columns, and we are unable to see it all without adjusting the width. We can do this by
-[source,r]
-----
-options(repr.matrix.max.cols=25, repr.matrix.max.rows=200)
-----
-and we also remember (from the previous project) that we can set the output in `R` to look more natural this way:
-[source,r]
-----
-options(jupyter.rich_display = F)
-----
-
-
-.Helpful Hint
-[%collapsible]
-====
-You can look at the first 6 rows (`head`), the last 6 rows (`tail`), the structure (`str`), and/or the dimensions (`dim`) of the dataset.
-====
-
-[loweralpha]
-.. How many unique regions are there in total? Name 5 of the different regions that are included in this dataset.
-.. How many cars are manufactured in 2011 or afterwards, i.e., they are made in 2011 or newer?
-.. In what year was the oldest model manufactured? In what year was the most recent model manufactured? In which year were the most cars manufactured?
-
-.Helpful Hint
-[%collapsible]
-====
-To sort and order a single vector you can use this code:
-[source,r]
-----
-head(myDF$year[order(myDF$year)])
-----
-You can also use the `sort` function, as demonstrated in earlier projects.
-====
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Answers to the 3 questions above.
-====
-
-=== TWO
-
-++++
-
-++++
-
-++++
-
-++++
-
-[loweralpha]
-.. Create a new column in your data.frame that is labeled `newflag` which indicates if the vehicle for sale has been labeled as `like new`. In other words, the column `newflag` should be `TRUE` if the vehicle on that row is `like new`, and `FALSE` otherwise.
-.. Create a new column called `pricecategory` that is
-... `cheap` for vehicles less than or equal to $1,500
-... `average` for vehicles strictly more than $1,500 but less than or equal to $10,000
-... `expensive` for vehicles strictly more than $10,000
-.. How many cars are there in each of these three `pricecategories` ?
-
-
-.Helpful Hint
-[%collapsible]
-====
-Remember to consider any 0 values and or `NA` values
-
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The answer to the questions above.
-====
-
-=== THREE
-
-++++
-
-++++
-
-++++
-
-++++
-
-_**vectoriztion**_
-
-Most of R's functions are vectorized, which means that the function will be applied to all elements of a vector, without needing to loop through the elements one at a time. The most common way to access individual elements is by using the `[]` symbol for indexing.
-
-[loweralpha]
-.. Using the `table()` function, and the column `myDF$newflag`, identify how many vehicles are `like new` and how many vehicles are not `like new`.
-.. Now using the `cut` function and appropriate `breaks`, create a new column called `newpricecategory`. Verify that this column is identical to the previously created `pricecategory` column, created in question TWO.
-.. Make another column called `odometerage`, which has values `new` or `middle age` or `old`, according to whether the odometer is (respectively): less than or equal to 50000; strictly greater than 50000 and less than or equal to 100000; or strictly greater than 100000. How many cars are in each of these categories?
-
-.Helpful Hint
-[%collapsible]
-====
-[source,r]
-----
-cut(myvector, breaks = c(10,50,200) , labels = c(a,b,c))
-----
-====
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The answer to the questions above.
-====
-
-==== FOUR
-
-++++
-
-++++
-
-**Preparing for Mapping**
-
-[loweralpha]
-.. Extract all of the data for `indianapolis` into a `data.frame` called `myIndy`
-.. Identify the most popular region from `myDF`, and extract all of the data from that region into a `data.frame` called `popularRegion`.
-.. Create a third `data.frame` with the data from a region of your choice
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The answer to the questions above.
-====
-
-
-==== FIVE
-
-++++
-
-++++
-
-**Mapping**
-
-Using the R package `leaflet`, make 3 maps of the USA, namely, one map for the data in each of the `data.frames` from question FOUR.
-
-
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The answers to the 3 questions above.
-====
-
-
-
-
-
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project05.adoc
deleted file mode 100644
index 811d5e02b..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project05.adoc
+++ /dev/null
@@ -1,200 +0,0 @@
-= TDM 10100: Project 5 -- Fall 2022
-Tapply and DataFrames
-
-++++
-
-++++
-
-**Motivation:** `R` differs from other programing languages that _typically_ work best using vectorized functions and the _apply_ suite instead of using loops.
-
-.Insider Knowledge
-[%collapsible]
-====
-Apply Functions: are an alternative to loops. You can use *`apply()`* and its varients (i.e. mapply(), sapply(), lapply(), vapply(), rapply(), and tapply()...) to manuiplate peices of data from data.frames, lists, arrays, matrices in a repetative way. The *`apply()`* functions allow for flexiabilty in crossing data in multiple ways that a loop does not.
-====
-
-**Context:** We will focus in this project on efficient ways of processing data in `R`.
-
-**Scope:** r, data.frames, recycling, factors, if/else, for loops, apply suite
-
-.Learning Objectives
-****
-- Demonstrate the ability to use the `tapply` function.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s) in anvil:
-
-/anvil/projects/tdm/data/election/escaped2020sample.txt
-
-.Helpful Hint
-[%collapsible]
-====
-A txt and csv file both sore information in plain text. *csv* files are _always_ separated by commas. In *txt* files the fields can be separated with commas, semicolons, or tab.
-
-
-To read in a txt file as a csv we simply add sep="|" (see code below)
-[source,r]
-----
- myDF <- read.csv("/anvil/projects/tdm/data/election/escaped2020sample.txt", sep="|")
-----
-====
-
-== Questions
-
-=== ONE
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-Read the dataset `escaped2020sample.txt` into a data.frame called `myDF`. The dataset contains contribution information for the 2020 election year.
-
-The dataset has a column named `TRANSACTION_DT` which is set up in the `[month].[day].[year]` format.
-We want to organize the dates in chronological order.
-
-When working with dates, it is important to use tools specifically for this purpose (rather than using string manipulation, for example). We've provided you with the code below. The provided code uses the `lubridate` package, an excellent package which hides away many common issues that occur when working with dates. Feel free to check out https://raw.githubusercontent.com/rstudio/cheatsheets/master/lubridate.pdf[the official cheatsheet] in case you'd like to learn more about the package.
-
-[source,r]
-----
-library(lubridate, warn.conflicts = FALSE)
-----
-
-[loweralpha]
-.. Use the `mdy` function (from the `lubridate` library) on the column `TRANSACTION_DT`, to create a new column named `newdates`.
-.. Using `tapply`, add the values in the `TRANSACTION_AMT` column, according to the values in the `newdate` column.
-.. Plot the dates on the x-axis and the information we found in part b on the y-axis.
-
-.Helpful Hint
-[%collapsible]
-====
-*tapply()* helps us to compute statistical measures such as mean, median, minimum, maximum, sum, etc... for data that is split into groups. *tapply()* is most helpful when we need to break up a vector into groups, and compute a function on each of the groups.
-====
-
-[WARNING]
-====
-If your `tapply` in Question 1b hates you (e.g., it will absolutely not finish the `tapply`, even after a few minutes), then the fix described below will likely help. Please note that, after you run this fix, you need to reset your memory back to 5000 MB at time 4:16 in the video.
-
-You do not need to run this "fix" unless you have a cell like this, which should be running, but you are "stuck" on it:
-====
-
-++++
-
-++++
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== TWO
-
-++++
-
-++++
-
-The plot that we just created in question one shows us that the majority of the data collected is found in the years 2018-2020. So we will focus on the year 2019.
-
-[loweralpha]
-.. Create a new dataframe that only contains data for the dates in the range 01/01/2019-05/15/2019
-.. Plot the new dataframe
-.. What do you notice about the data?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Answer to the questions above
-====
-
-=== THREE
-
-++++
-
-++++
-
-++++
-
-++++
-
-Lets look at the donations by city and state
-
-[loweralpha]
-.. Find the sum of the total donations contributed in each state.
-.. Create a new column that pastes together the city and state.
-.. Find the total donation amount for each city/state location. In the output do you notice anything suspicious in the result? How do you think that occured?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Answers to the questions above.
-====
-
-=== FOUR
-
-++++
-
-++++
-
-Lets take a look who is donating
-
-[loweralpha]
-.. Find the type of data that is in the `NAME` columm
-.. Split up the names in the `NAME` column, to extract the first names of the donors. (This will not be perfect, but it is our first attempt.)
-.. How much money is donated (altogether) by people named `Mary`?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Answer to the questions above
-====
-
-=== FIVE
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-Employment status
-
-[loweralpha]
-.. Using a `barplot` or `dotchart`, show the total amount of donations made by `EMPLOYED` vs `NOT EMPLOYED` individuals
-.. What is the category of occupation that donates the most money?
-.. Plot something that you find interesting about the employment and/or occupation columns
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- 1-2 sentences explaining what is was you chose to plot and why
-- Answering to the questions above
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project06.adoc
deleted file mode 100644
index cde2eac5f..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project06.adoc
+++ /dev/null
@@ -1,118 +0,0 @@
-= TDM 10100: Project 6 -- Fall 2022
-Tapply, Tapply, Tapply
-
-**Motivation:** We want to have fun and get used to the function `tapply`
-
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/olympics/athlete_events.csv`
-- `/anvil/projects/tdm/data/death_records/DeathRecords.csv`
-
-== Questions
-
-=== ONE
-
-++++
-
-++++
-
-Read the dataset `/anvil/projects/tdm/data/olympics/athlete_events.csv`, into a data.frame called `eventsDF`. (We do not need the `tapply` function for Question 1.)
-
-[loweralpha]
-.. What are the years included in this data.frame?
-.. What are the different countries participating in the Olympics?
-.. How many times is each country represented?
-
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Answers to the code above.
-====
-
-=== TWO
-
-++++
-
-++++
-
-[loweralpha]
-.. What is the average height of participants from each country?
-.. What are the oldest ages of the athletes from each country?
-.. What is the sum of the weights of all participants from each country?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Answers to the code above
-====
-
-=== THREE
-
-++++
-
-++++
-
-Read the dataset `/anvil/projects/tdm/data/death_records/DeathRecords.csv` into a data.frame called `deathrecordsDF`. (We do not need the `tapply` function for Question 3.)
-
-[loweralpha]
-.. What are the column names in this dataframe?
-.. Change the column "DayOfWeekOfDeath" from numbers to weekdays
-.. How many people died in total on each day of the week?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Answers to the questions above
-====
-
-=== FOUR
-
-++++
-
-++++
-
-[loweralpha]
-.. What is the average age of Females versus Males at death?
-.. What is the number of Females who are married? Divorced? Widowed? Single? Now find the analogous numbers for Males.
-.. Now solve both questions from 4b at one time, i.e., use one command to find the number of Females who are married, divorced, widowed, or single, and the number of Males in each of these four categories. You can compute all eight numbers with just one `tapply` command.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Answers to the question above
-====
-
-=== FIVE
-
-++++
-
-++++
-
-[loweralpha]
-.. Using the two data sets create two separate graphs or plots on the data that you find interesting (one graph or plot for each of the two data sets in this project). Write 1-2 sentences on each one and why you found it interesting/what you noticed in the dataset.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
-
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project07.adoc
deleted file mode 100644
index 619e7e645..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project07.adoc
+++ /dev/null
@@ -1,181 +0,0 @@
-= TDM 10100: Project 7 -- 2022
-
-**Motivation:** A couple of bread-and-butter functions that are a part of the base R are: `subset`, and `merge`. `subset` provides a more natural way to filter and select data from a data.frame. `merge` brings the principals of combining data that SQL uses, to R.
-
-**Context:** We've been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with these useful functions, all the while gaining experience and practice wrangling data!
-
-**Scope:** r, subset, merge, tapply
-
-.Learning Objectives
-****
-- Gain proficiency using split, merge, and subset.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Demonstrate how to use tapply to solve data-driven problems.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/movies_and_tv/titles.csv`
-- `/anvil/projects/tdm/data/movies_and_tv/episodes.csv`
-- `/anvil/projects/tdm/data/movies_and_tv/people.csv`
-- `/anvil/projects/tdm/data/movies_and_tv/ratings.csv`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-
-[IMPORTANT]
-====
-Please select 6000 memory when launching Jupyter for this project.
-====
-
-Data can come in a lot of different formats and from a lot of different locations. It is not uncommon to have one or more files that need to be combined together before analysis is performed. `merge` is a popular function in most data wrangling libraries. It is extremely similar and essentially equivalent to a `JOIN` in SQL.
-
-Read in each of the datasets into data.frames called: `titles`, `episodes`, `people`, and `ratings`.
-
-[NOTE]
-====
-Read the data in using the following code. `fread` is a _very_ fast and efficient way to read in data.
-
-[source,r]
-----
-library(data.table)
-
-titles <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/titles.csv"))
-episodes <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/episodes.csv"))
-people <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/people.csv"))
-ratings <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/ratings.csv"))
-----
-====
-
-- What are all the different listed genres (in the `titles` table)?
-- Look at the `years` column and the `genres` column. In which year did the most comedies debut?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Use the `episode_title_id` column and the `title_id` column from the `episodes` and `titles` data.frame's (respectively) to merge the two data.frames.
-
-Ultimately, we want to end up with a new data.frame that contains the `primary_title` for every episodes in the `episodes` table. Use the `merge` function to accomplish this.
-
-[TIP]
-====
-The `merge` function in `R` allows two data frames to be combined by common columns. This function allows the user to combine data similar to the way `SQL` would using `JOIN`s. https://www.codeproject.com/articles/33052/visual-representation-of-sql-joins[Visual representation of SQL Joins]
-====
-
-[TIP]
-====
-This is also a really great https://www.datasciencemadesimple.com/join-in-r-merge-in-r/[explanation of merge in `R`].
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-++++
-
-++++
-
-Use `merge` (a few times) to create a new data.frame that contains at least the following information for **only** the show called "Friends". "Friends" (the show itself) has a `title_id` of tt0108778. Each episode of Friends, has its own `title_id` which contains the information for the specific episode as well.
-
-- The `primary_title` of the **episode** -- call it `episode_title`.
-- The `primary_title` of the **show itself** -- call it `show_title`.
-- The `rating` of the show itself -- call it `show_rating`.
-- The `rating` of the episode -- call it `episode_rating`.
-
-[TIP]
-====
-Start by getting a subset of the `episodes` table that contains only information for the show Friends. That way, we aren't working with as much data.
-====
-
-Show the top 5 rows of your final data.frame that contain the top 5 rated episodes.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-Use regular old indexing to find all episodes of friends with an `episode_rating` greater than 9 and `season_number` of exactly 5.
-
-Repeat the process, but this time use the `subset` function instead.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-`subset` is a sometimes useful function that allows you to index data.frame's in a less verbose manner. Read https://the-examples-book.com/programming-languages/R/subset[this].
-
-While it maybe appears to be a clean way to subset data, I'd suggest avoiding it over explicit long-form indexing. Read http://adv-r.had.co.nz/Computing-on-the-language.html[this fantastic article by Dr. Hadley Wickham on non-standard evaluation]. Take for example, the following (a bit contrived) example using the dataframe we got in question (3).
-
-[source,r]
-----
-season_number = 6
-results[results$episode_rating > 9 & results$season_number == season_number,]
-subset(results, episode_rating > 9 & season_number == season_number)
-----
-
-Read that provided article and do your best to explain _why_ `subset` gets a different result than our example that uses regular indexing.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project08.adoc
deleted file mode 100644
index 5d98f0086..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project08.adoc
+++ /dev/null
@@ -1,203 +0,0 @@
-= TDM 10100: Project 8 -- 2022
-
-**Motivation:** Functions are an important part of writing efficient code. +
-Functions allow us to repeat and reuse code. If you find you using a set of coding steps over and over, a function may be a good way to reduce your lines of code!
-
-**Context:** We've been learning about and using functions these last few weeks. +
-To learn how to write your own functions we need to learn some of the terminology and components.
-
-**Scope:** r, functions
-
-.Learning Objectives
-****
-- Gain proficiency using split, merge, and subset.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Comprehend what a function is, and the components of a function in R.
-****
-
-== Dataset(s)
-
-We will use the same dataset(s) as last week:
-
-- `/anvil/projects/tdm/data/movies_and_tv/titles.csv`
-- `/anvil/projects/tdm/data/movies_and_tv/episodes.csv`
-- `/anvil/projects/tdm/data/movies_and_tv/people.csv`
-- `/anvil/projects/tdm/data/movies_and_tv/ratings.csv`
-
-
-[IMPORTANT]
-====
-Please select 6000 memory when launching Jupyter for this project.
-====
-
-.Helpful Hints
-[%collapsible]
-====
-`fread`- is a fast and efficient way to read in data.
-
-[source,r]
-----
-library(data.table)
-
-titles <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/titles.csv"))
-episodes <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/episodes.csv"))
-people <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/people.csv"))
-ratings <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/ratings.csv"))
-----
-====
-
-== Questions
-
-Writing our own function to make a repetitive operation easier by turning it into a single command. +
-
-Take care to name the function something concise but meaningful so that others can understand what the function can be understood by other users. +
-
-Function parameters can also be called formal arguments.
-
-.Insider Knowledge
-[%collapsible]
-====
-A function is an object that contains multiple interrelated statments put together in a predefined order when called(run). +
-
-Functions can be built-in or created by the user (user-defined). +
-
-.Some examples of built in functions are:
-
-* min(), max(), mean(), median()
-* print()
-* head()
-
-====
-
-.Helpful Hints
-[%collapsible]
-====
-Syntax of a function
-[source, R]
-----
-what_you_name_the_function <- function (parameters) {
- statement(s) that are executed when the function runs
- the last line of the function is the returned value
-}
-----
-====
-
-=== ONE
-
-++++
-
-++++
-
-++++
-
-++++
-
-To gain a better insight into our data, let's make two simple plots:
-
-[loweralpha]
-.. A grouped bar chart https://www.statmethods.net/graphs/bar.html[see an example here]
-.. A line plot http://www.sthda.com/english/wiki/line-plots-r-base-graphs[see an example here]
-.. What information are you gaining from either of these graphs?
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== TWO
-
-++++
-
-++++
-
-For practice, now that you have a basic understanding of how to make a function, we will use that knowledge, applied to our dataset.
-
-Here are pieces of a function we will use on this dataset; put them in the correct order +
-
-* results <- merge(ratings_df, titles_df, by.x = "title_id", by.y = "title_id")
-* }
-* function(titles_df, ratings_df, ratings_of_at_least)
-* return(popular_movie_results)
-* {
-* popular_movie_results <- results[results$type == "movie" & results$rating >= ratings_of_at_least, ]
-* find_movie_with_at_least_rating <-
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== THREE
-
-++++
-
-++++
-
-Take the above function and add comments explaining what the function does at each step.
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== FOUR
-
-++++
-
-++++
-
-[source,r]
-----
-my_selection <- find_movie_with_at_least_rating(titles, ratings, 7.6)
-----
-
-Using the code above answer these questions.
-
-[loweralpha]
-.. How many movies in total are there, which are above that limit?
-.. Change the limits in the function from "at least 5.0" to "lower than 5.0".
-.. How many movies have ratings lower than 5.0?
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== FIVE
-
-++++
-
-++++
-
-++++
-
-++++
-
-Now create a function that takes a genre as the input and finds either
-[loweralpha]
-.. the movie from that genre that has the largest number of votes, OR
-.. the movie from that genre that has the highest rating.
-
-(You don't need to do both. In the video, I discuss how to find the movie from that genre that has the highest rating.)
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project09.adoc
deleted file mode 100644
index ddce52640..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project09.adoc
+++ /dev/null
@@ -1,191 +0,0 @@
-= TDM 10100: Project 9 -- 2022
-:page-mathjax: true
-
-Benford's Law
-
-**Motivation:**
-https://en.wikipedia.org/wiki/Benford%27s_law[Benford's law] has many applications, including its infamous use in fraud detection. It also helps detect anomolies in naturally occurring datasets.
-
-**Scope:** 'R' and functions
-
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-* /anvil/projects/tdm/data/election/escaped2020sample.txt
-
-.Helpful Hint
-[%collapsible]
-====
-A txt and csv file both store information in plain text. csv files are always separated by commas. In txt files the fields can be separated with commas, semicolons, or tab.
-
-To read in a txt file as a csv we simply add sep="|" (see code below)
-
-[source,r]
-----
-myDF <- read.csv("/anvil/projects/tdm/data/election/escaped/escaped2020sample.txt", sep="|")
-----
-====
-
-== Questions
-
-https://www.statisticshowto.com/benfords-law/[Benford's law] (also known as the first digit law) states that the leading digits in a collection of datasets will most likely be small. +
-It is basically a https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/probability-distribution/[probability distribution] that gives the likelihood of the first digit occurring, in a set of numbers.
-
-Another way to understand Benford's law is to know that it helps us assess the relative frequency distribution for the leading digits of numbers in a dataset. It states that leading digits with smaller values occur more frequently.
-
-.Insider Knowledge
-[%collapsible]
-====
-A probability distrubution helps definte what the probability of an event happening is. It can be simple events like a coin toss, or it can be applied to complex events such as the outcome of drug treatments etc. +
-
-* Basic probability distributions which can be shown on a probability distribution table.
-* Binomial distributions, which have “Successes” and “Failures.”
-* Normal distributions, sometimes called a Bell Curve.
-
-Remember that the sum of all the probablities in a distrubution is always 100% or 1 as a decimal.
-====
-
-.Helpful Hint
-[%collapsible]
-====
-This law only works for numbers that are *significand S(x)* which means any number that is set into a standard format. +
-
-To do this you must
-
-* Find the first non-zero digit
-* Move the decimal point to the right of that digit
-* Ignore the sign
-
-An example would be 9087 and -.9087 both have the *S(x)* as 9.087
-
-It can also work to find the second, third and succeeding numbers. It can also find the probability of certian combinations of numbers. +
-
-Typically does not apply to data sets that have a minimum and maximum (restricted). And to datasets if the numbers are assigned (i.e. social security numbers, phone numbers etc.) and not naturally occurring numbers. +
-
-Larger datasets and data that ranges over multiple orders of magnitudes from low to high work well using Bedford's law.
-====
-
-Benford's law is given by the equation below.
-
-
-$P(d) = \dfrac{\ln((d+1)/d)}{\ln(10)}$
-
-$d$ is the leading digit of a number (and $d \in \{1, \cdots, 9\}$)
-
-An example the probability of the first digit being a 1 is
-
-$P(1) = \dfrac{\ln((1+1)/1)}{\ln(10)} = 0.301$
-
-=== ONE
-
-++++
-
-++++
-
-[loweralpha]
-
-.. Create a function called `benfords_law` that takes the argument `digit`, and calculates the probability of `digit` being the starting figure of a random number based on Benford's law.
-
-.. Create a vector named `digits` with numbers 1-9
-
-.. Now use the `benfords_law` function to create a plot (could be a bar plot, line plot, dot plot, etc., anything is OK) that shows the likelihood of `digits` occurring
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== TWO
-
-++++
-
-++++
-
-[loweralpha]
-. Read in the elections data (we have used this previously) into a dataset named `myDF`.
-
-. Create a vector called `firstdigit` with the first digit from the `TRANSACTION_AMT` and then plot it (again, could be a bar plot, line plot, dot plot, etc., anything is OK).
-
-. Does it look like it follows Bedford's law? Why or why not?
-
-.Helpful Hint
-[%collapsible]
-====
-use this to help plot
-[source,r]
-----
-firstdigit <- as.numeric(firstdigit)
-hist(firstdigit)
-----
-====
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== THREE
-
-++++
-
-++++
-
-Create a function that will look at both the `EMPLOYER` and the `OCCUPATION` columns and return a new data frame with an added column named `Employed` that is FALSE if `EMPLOYER` is "NOT EMPLOYED",
-and is FALSE if `OCCUPATION` is "NOT EMPLOYED",
-and is TRUE otherwise.
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== FOUR
-
-How many arguments does the above function have?
-What does each line do? Use #comment to explain your function.
-
-Using a graph, can you show the percentage of individuals employed vs not employed?
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== FIVE
-
-++++
-
-++++
-
-Write your own custom function! Make sure your function has at least two arguments and get creative. Your function could output a plot, or search and find information within the data.frame. Use what you have learned in Project 8 and 9 to help guide you.
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-
-.Resources
-[%collapsible]
-====
-* https://towardsdatascience.com/what-is-benfords-law-and-why-is-it-important-for-data-science-312cb8b61048["What is Benford's Law and Why is it Important for Data Science"]
-
-*
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project10.adoc
deleted file mode 100644
index 3c08f007f..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project10.adoc
+++ /dev/null
@@ -1,204 +0,0 @@
-= TDM 10100: Project 10 -- 2022
-Creating functions and using tapply and sapply
-
-**Motivation:** As we have learned functions are foundational to more complex programs and behaviors. +
-There is an entire programming paradigm based on functions called https://en.wikipedia.org/wiki/Functional_programming[functional programming].
-
-**Context:**
-We will apply functions to entire vectors of data using `sapply`. We learned how to create functions, and now the next step we will take is to use it on a series of data. `sapply` is one of the best ways to do this in `R`.
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-* /anvil/projects/tdm/data/okcupid/filtered/users.csv
-* /anvil/projects/tdm/data/okcupid/filtered/questions.csv
-
-.Helpful Hint
-[%collapsible]
-====
-read.csv() function automatically delineates by a comma`,` +
-You can use other delimiters by using adding the `sep` argument +
-i.e. `read.csv(...sep=';')` +
-
-Use the `readlines(...,n=x)` function to see the first x number of rows to identify what the character that you will use in the `sep` argument.
-====
-
-
-== Questions
-
-=== ONE
-
-++++
-
-++++
-
-We want to go ahead and load the datasets into data.frames named `users` and `questions`. Take a look at both data.frames and identify what is a part of each of them. What information is in each datatset, and how they are related?
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- 1 or 2 sentences on the datasets.
-====
-
-=== TWO
-
-++++
-
-++++
-
-Simply put, `grep` helps us to find a word within a string. In `R` `grep` is vectorized and can be applied to an entire vector of strings. We will use it to find the any questions that mention `google` in the data.frame `questions`.
-[loweralpha]
-.. What do you notice if you just use the function `grep()` and create a new variable google and then print that variable?
-
-.. Now that you know the row number, how can you take a look at the information there?
-
-(Bonus question: can find a shortcut to steps a & b?)
-
-.Helpful Hint
-[%collapsible]
-====
-https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/grep[*grep*] - `grep()` is a function in `R` that is used to search for matches of a pattern within each element of a string.
-[source,r]
-----
-grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
- fixed = FALSE, useBytes = FALSE, invert = FALSE)
-
-grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
- fixed = FALSE, useBytes = FALSE)
-----
-====
-
-.Insider Information
-[%collapsible]
-====
-Just an FYI refresh: +
-
-* `<-` is an assignment operator, it assigns values to a variable
-
-* Functions *must* be called using the round brackets aka parenthesis *`()`*
-
-* Square brackets *`[]`*, are also called `extraction operators` as they are used to help extract specific elements from a vector or matrix.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== THREE
-
-++++
-
-++++
-
-[loweralpha]
-.. Using the row from our previous question, which variable does this correspond with in the data.frame `users`?
-
-.. Knowing that the two possible answers are "No. Why spoil the mystery?" and "Yes, Knowledge is power!" What percentage of users do *NOT* google someone before the first date?
-
-
-.Helpful Hint
-[%collapsible]
-====
-* Row 2172 in `questions` corresponds to column named `q170849` in `users`
-
-* The `table()` function can be used to quickly create frequency tables
-
-* The `prop.table()` function can calculate the value of each cell in a table as a proportion of all values.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== FOUR
-
-++++
-
-++++
-
-Using the ability to create a function *AND* `tapply` find the percentages of Female vs Male (Man vs Woman, as categorized in the users data.frame) who *DO* google someone before their date.
-
-
-
-.Helpful Hint
-[%collapsible]
-====
-* https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/tapply[`tapply()`] function can be used to apply some function to a vector that has been grouped by another vector.
-`tapply(x, INDEX, FUNCTION)`
-====
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== FIVE
-
-++++
-
-++++
-
-Using the ability to create a function *AND* using `sapply()` write a function that takes the string and removes everything after/including the _ from the `gender_orientation` column in the `users` data.frame. Or it is OK to solve this question as given in the video, without a function and without `sapply()`.
-
-meaning that Hetero_male -> Hetero, we want to do this for the entire column `gender_orientation`
-
-
-
-.Insider Information
-[%collapsible]
-====
-Sapply()- allows you to iterate over a list or vector _without_ the need to use a for loop which is typically a slow way to work in `R`.
-
-Remember the difference +
-(a `very` brief summary of each)
-
-* A vector is the basic data structure in `R` they typically are atomic vectors and lists and have three common properties
- * Type- typeof()
- * Length- length()
- * Attributes- attributes()
-They are different due to the type of elements they hold. All elements in an atomic vector must be the same(they are also always "flat"), but elements of a list can be different types.
-construction of lists are done by using the function `list()`. The construction of atomic vectors are done by using the function `c()`.
-You can determine specific type by using functions like *is.character(), is.double(), is.integer(), is.logical()*
-
-* A matrix is a two-dimensional; rows and columns and all cells must be the same type. Can be created with the function `matrix()`.
-
-* An array can be one dimension multi-dimensional. An array with one dimension is similar (but not exact) as a vector. An array with two dimensions is similar (but not exact) as a matrix. An array with three or more dimensions is an n-dimensional array. can be created with the function `array()`.
-
-* A data frame is like a table, or like a matrix, *BUT* the columns can hold different types of data.
-====
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-
-.*Resources*
-[%collapsible]
-====
-* https://www.geeksforgeeks.org/find-position-of-a-matched-pattern-in-a-string-in-r-programming-grep-function/
-
-====
-
-
-
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project11.adoc
deleted file mode 100644
index e69de29bb..000000000
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project12.adoc
deleted file mode 100644
index 5475dfb4b..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project12.adoc
+++ /dev/null
@@ -1,171 +0,0 @@
-= TDM 10100: Project 12 -- 2022
-Tidyverse and Lubridate
-**Motivation:**
-In the previous project we manipulated dates, this project we are going to take it a bit further and use Tidyverse, more specifically the Luberdate package.
-Working with dates in `R` can require more attention than working with other object classes. These packages will help simplify some of the common tasks related to date data. +
-
-Dates and times can be complicated, not every year has 365 days, not every day has 24 hours, and not every minute has 60 seconds. Dates are difficult because they have to accommodate for the Earth's rotation and orbit around the sun as well as the occurrence of timezones, daylight savings etc.
-Suffice to say that when focusing on dates and date-times in R the simpler the better. Lubridate helps do so.
-
-.Learning Objectives
-****
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Utilize apply functions in order to solve a data-driven problem.
-- Gain proficiency using split, merge, and subset.
-- Demonstrate the ability to create basic graphs with default settings.
-- Demonstrate the ability to modify axes labels and titles.
-- Incorporate legends using legend().
-- Demonstrate the ability to customize a plot (color, shape/linetype).
-- Convert strings to dates, and format dates using the lubridate package.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- /anvil/projects/tdm/data/zillow/State_time_series.csv
-
-== Questions
-First lets import the libraries +
-
-* data.table
-* lubridate
-[source,r]
-----
-library(data.table) # make sure to load data.table first
-library(lubridate) # and then to load lubridate second; it will give you a warning in pink color but it is totally OK
-# You need to load `data.table` first and `lubridate` second for this project, because they both define `wday` and we want the version from `lubridate` so we need to load it second!
-----
-We are going to continue to dig into the Zillow time series data.
-
-=== ONE
-
-++++
-
-++++
-
-[loweralpha]
-. Go ahead and read in the dataset as `states`
-. Find the class and the type of the column named `Date`
-. Are there multiple functions that will return the same or similar information?
-
-
-.Insider Knowledge
-[%collapsible]
-====
-Reminder: +
-- `class` shows the class of the specified object used as the arguments. The most common ones include but are not limited to: "numeric", "character", "logical", "date". +
-- `typeof` shows you the type or storage mode of objects. The most common ones include but are not limited to: "logical", "integer", "double", "complex", "character", "raw" and "list"
-====
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== TWO
-
-++++
-
-++++
-
-[loweralpha]
-. In Project 11, we had to convert the `Date` column to a month, day, year format. Now convert the column `Date` into values from the class *Date*. (You can use lubridate to do so.) What do you think about the methods you have learned (so far) to convert dates?
-. Create a new column in your data.frame `states` named `day_of_the_week` that shows (Sunday-Saturday).
-. Lets create another column in the data.frame `states` that shows the days of the week as numbers.
-
-
-[source,r]
-----
-county$Date <- as.Date(county$Date, format="%Y-%m-%d")
-----
-
-
-.Helpful Hint
-[%collapsible]
-====
-Take a look at the functions `ymd`, `mdy`, `dym`
-====
-
-.Helpful Hint
-[%collapsible]
-====
-- Take a look at the functions `month`, `year`, `day`, `wday`.
-- The *label* argument is logical. It is also only available for wday() function. TRUE will display the day of the week as an ordered factor of character strings, such as "Sunday." FALSE will display the day of the week as a number.
-- The *week_start* argument by default the days are counted as 1 means Monday, 7 means Sunday When label = TRUE, this will be the first level of the returned factor. You can set lubridate.week.start option to control this parameter.
-====
-
-.Insider Knowledge
-[%collapsible]
-====
-Default values of class *Date* in `R` is displayed as YYYY-MM-DD
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== THREE
-
-++++
-
-++++
-
-We want to see if there is a better month(s) for putting our house on the market?
-[loweralpha]
-. Use `tapply` to compare the average `DaysOnZillow_AllHomes` for all months.
-. Make a barplot showing our results.
-
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== FOUR
-
-++++
-
-++++
-
-Find the information only for the year 2017 and call it `states2017`. Then create a lineplot that shows the average `DaysOnZillow_AllHomes` by `Date` using the `states2017` data. What do you notice? When was the best month/months for posting a home for sale in 2017?
-
-=== FIVE
-
-++++
-
-++++
-
-Now we want to know if homes sell faster in different states? Lets look at Indiana, Maine, and Hawaii. Create a lineplot that uses `DaysOnZillow_AllHomes` by `Date` with one line per state. Use the `states2017` dataset for this question. Make sure to have each state line colored differently and have a legend to identify which is which.
-
-.Helpful Hint
-[%collapsible]
-====
-Use the `lines()` function to add lines to your plot +
-Use the `ylim` argument to show all lines +
-Use the `col` argument to identify and alter colors.
-====
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project13.adoc
deleted file mode 100644
index 23f0cdd27..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project13.adoc
+++ /dev/null
@@ -1,82 +0,0 @@
-= TDM 10100: Project 13 -- 2022
-
-**Motivation:** This semester we took a deep dive into `R` and it's packages. Lets take a second to pat ourselves on the back for surviving a long semester and review what we have learned!
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- /anvil/projects/tdm/data/beer/beers.csv
-
-== Questions
-
-=== ONE
-
-++++
-
-++++
-
-Read in the dataset and into a data.frame called `beer`
-[loweralpha]
-. What is the file size, how many rows, columns and type of data?
-. What is the average score for a `stout`? (consider a stout any named beer from the column `name` with the word `stout` in it)
-. How many `Pale Ale's` are on this list? (consider a stout any named beer from the column `name` with the word `pale` and `ale` in it)
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== TWO
-
-++++
-
-++++
-
-. Plot or Graph all the beers that are available in the summer and their ratings.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== THREE
-
-++++
-
-++++
-
-. Create a plot of the average rating of beer by country.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== FOUR
-
-++++
-
-++++
-
-++++
-
-++++
-
-. Do `limited` runs of beer have a greater median rating than all others?
-(consider limited to be any beer that has the word `Limited` in the `availablity` column)
-
-. Use the `unique` function to investigate the availablity column. Why are there different labels that are technically the same?
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-projects.adoc
deleted file mode 100644
index 104f7f661..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-projects.adoc
+++ /dev/null
@@ -1,41 +0,0 @@
-= TDM 10100
-
-== Project links
-
-[NOTE]
-====
-Only the best 10 of 13 projects will count towards your grade.
-====
-
-[CAUTION]
-====
-Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses.
-====
-
-[%header,format=csv,stripes=even,%autowidth.stretch]
-|===
-include::ROOT:example$10100-2022-projects.csv[]
-|===
-
-[WARNING]
-====
-Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete.
-
-**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information.
-
-Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza.
-====
-
-== Piazza
-
-=== Sign up
-
-https://piazza.com/purdue/fall2022/tdm10100[https://piazza.com/purdue/fall2022/tdm10100]
-
-=== Link
-
-https://piazza.com/purdue/fall2022/tdm10100/home[https://piazza.com/purdue/fall2022/tdm10100/home]
-
-== Syllabus
-
-See xref:fall2022/logistics/syllabus.adoc[here].
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project01.adoc
deleted file mode 100644
index c677b0ee6..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project01.adoc
+++ /dev/null
@@ -1,282 +0,0 @@
-= TDM 20100: Project 1 -- 2022
-
-**Motivation:** It’s been a long summer! Last year, you got some exposure to both R and Python. This semester, we will venture away from R and Python, and focus on UNIX utilities like `sort`, `awk`, `grep`, and `sed`. While Python and R are extremely powerful tools that can solve many problems — they aren’t always the best tool for the job. UNIX utilities can be an incredibly efficient way to solve problems that would be much less efficient using R or Python. In addition, there will be a variety of projects where we explore SQL using `sqlite3` and `MySQL/MariaDB`.
-
-We will start slowly, however, by learning about Jupyter Lab. This year, instead of using RStudio Server, we will be using Jupyter Lab. In this project we will become familiar with the new environment, review some, and prepare for the rest of the semester.
-
-**Context:** This is the first project of the semester! We will start with some review, and set the "scene" to learn about some powerful UNIX utilities, and SQL the rest of the semester.
-
-**Scope:** Jupyter Lab, R, Python, Anvil, markdown
-
-.Learning Objectives
-****
-- Read about and understand computational resources available to you.
-- Learn how to run R code in Jupyter Lab on Anvil.
-- Review R and Python.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/flights/subset/1991.csv`
-- `/anvil/projects/tdm/data/movies_and_tv/imdb.db`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-++++
-
-++++
-
-For this course, projects will be solved using the https://www.rcac.purdue.edu/compute/anvil[Anvil computing cluster].
-
-Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (RAM). Use the information on the provided webpages to calculate how many cores and how much memory is available _in total_ for the Anvil "sub-clusters".
-
-Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer of your own, work with a friend to see how many cores there are, and how much memory is available, on their computer.
-
-[NOTE]
-====
-Last year, we used the https://www.rcac.purdue.edu/compute/brown[Brown computing cluster]. Compare the specs of https://www.rcac.purdue.edu/compute/anvil[Anvil] and https://www.rcac.purdue.edu/compute/brown[Brown] -- which one is more powerful?
-====
-
-.Items to submit
-====
-- A sentence explaining how many cores and how much memory is available, in total, across all nodes in the sub-clusters on Anvil.
-- A sentence explaining how many cores and how much memory is available, in total, for your own computer.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Like the previous year we will be using Jupyter Lab on the Anvil cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster.
-
-Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (and Duo). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward.
-
-[TIP]
-====
-If you did not (yet) setup your 2-factor authentication credentials with Duo, you can go back to Step 9 and setup the credentials here: https://the-examples-book.com/starter-guides/anvil/access-setup
-====
-
-Towards the middle of the top menu, there will be an item labeled btn:[My Interactive Sessions], click on btn:[My Interactive Sessions]. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Bioinformatics, Interactive Apps, and The Data Mine. Under "The Data Mine" section, you should see a button that says btn:[Jupyter Notebook], click on btn:[Jupyter Notebook].
-
-If everything was successful, you should see a screen similar to the following.
-
-image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"]
-
-Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 2 CPU cores and 3800 Mb.
-
-[NOTE]
-====
-It is OK to not understand what that means yet, we will learn more about this in TDM 30100. For the curious, however, if you were to open a terminal session in Anvil and run the following, you would see your job queued up.
-
-[source,bash]
-----
-squeue -u username # replace 'username' with your username
-----
-====
-
-[NOTE]
-====
-If you select 4000 Mb of memory instead of 3800 Mb, you will end up getting 3 CPU cores instead of 2. OnDemand tries to balance the memory to CPU ratio to be _about_ 1900 Mb per CPU core.
-====
-
-We use the Anvil cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine.
-
-After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on btn:[Connect to Jupyter] to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following.
-
-image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"]
-
-There are 2 primary options that you will need to know about.
-
-f2022-s2023::
-The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment.
-
-[TIP]
-====
-To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page].
-====
-
-f2022-s2023-r::
-An alternative, native R kernel that you can use for projects with _just_ R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell.
-
-For now, let's focus on the f2022-s2023 kernel. Click on btn:[f2022-s2023], and a fresh notebook will be created for you.
-
-[NOTE]
-====
-Soon, we'll have the f2022-s2023-r kernel available and ready to use!
-====
-
-Test it out! Run the following code in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on. What is the name of the node on Anvil that you are running on?
-
-[source,python]
-----
-import socket
-print(socket.gethostname())
-----
-
-[TIP]
-====
-To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu.
-====
-
-.Items to submit
-====
-- Code used to solve this problem in a "code" cell.
-- Output from running the code (the name of the node on Anvil that you are running on).
-====
-
-=== Question 3
-
-++++
-
-++++
-
-++++
-
-++++
-
-In the upper right-hand corner of your notebook, you will see the current kernel for the notebook, `f2022-s2023`. If you click on this name you will have the option to swap kernels out -- no need to do this yet, but it is good to know!
-
-Practice running the following examples.
-
-python::
-[source,python]
-----
-my_list = [1, 2, 3]
-print(f'My list is: {my_list}')
-----
-
-SQL::
-[source, sql]
-----
-%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db
-----
-
-[source, ipython]
-----
-%%sql
-
-SELECT * FROM titles LIMIT 5;
-----
-
-[NOTE]
-====
-In a previous semester, you'd need to load the sql extension first -- this is no longer needed as we've made a few improvements!
-
-[source,ipython]
-----
-%load_ext sql
-----
-====
-
-bash::
-[source,bash]
-----
-%%bash
-
-awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' /anvil/projects/tdm/data/flights/subset/1991.csv
-----
-
-[TIP]
-====
-To learn more about how to run various types of code using this kernel, see https://the-examples-book.com/projects/templates[our template page].
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-This year, the first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`).
-
-Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`.
-
-There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default?
-
-Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A.
-
-.Items to submit
-====
-- How many of each types of cells are there in the default template?
-====
-
-=== Question 5
-
-Markdown is well worth learning about. You may already be a Markdown expert, however, more practice never hurts.
-
-Create a Markdown cell in your notebook.
-
-Create both an _ordered_ and _unordered_ list. Create an unordered list with 3 of your favorite academic interests (some examples could include: machine learning, operating systems, forensic accounting, etc.). Create another _ordered_ list that ranks your academic interests in order of most-interested to least-interested. To practice markdown, **embolden** at least 1 item in you list, _italicize_ at least 1 item in your list, and make at least 1 item in your list formatted like `code`.
-
-[TIP]
-====
-You can quickly get started with Markdown using this cheat sheet: https://www.markdownguide.org/cheat-sheet/
-====
-
-[TIP]
-====
-Don't forget to "run" your markdown cells by clicking the small "Play" button in the notebook menu. Running a markdown cell will render the text in the cell with all of the formatting you specified. Your unordered lists will be bulleted and your ordered lists will be numbered.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 6
-
-Browse https://www.linkedin.com and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown in a new Markdown cell. Include the following (at a minimum):
-
-- A header for this section (your choice of size) that says "About".
-+
-[TIP]
-====
-A Markdown header is a line of text at the top of a Markdown cell that begins with one or more `#`.
-====
-+
-- The text of your personal "About" section that you would feel comfortable uploading to LinkedIn.
-- In the about section, _for the sake of learning markdown_, include at least 1 link using Markdown's link syntax.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 7
-
-++++
-
-++++
-
-Review your Python and R skills. For each language, choose at least 1 dataset from `/anvil/projects/tdm/data`, and analyze it. Both solutions should include at least 1 custom function, and at least 1 graphic output. Make sure your code is complete, and well-commented. Include a markdown cell with your short analysis (1 sentence is fine), for each language.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project02.adoc
deleted file mode 100644
index 570e44590..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project02.adoc
+++ /dev/null
@@ -1,312 +0,0 @@
-= TDM 20100: Project 2 -- 2022
-
-**Motivation:** The ability to navigate a shell, like `bash`, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful UNIX tools, help you navigate a filesystem, and even run UNIX tools from within your Jupyter Lab notebook.
-
-**Context:** At this point in time, our Jupyter Lab system, using https://ondemand.anvil.rcac.purdue.edu, is new to some of you, and maybe familiar to others. The comfort with which you each navigate this UNIX-like operating system will vary. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within Jupyter Lab.
-
-**Scope:** bash, Jupyter Lab
-
-.Learning Objectives
-****
-- Distinguish differences in `/home`, `/anvil/scratch`, and `/anvil/projects/tdm`.
-- Navigating UNIX via a terminal: `ls`, `pwd`, `cd`, `.`, `..`, `~`, etc.
-- Analyzing file in a UNIX filesystem: `wc`, `du`, `cat`, `head`, `tail`, etc.
-- Creating and destroying files and folder in UNIX: `scp`, `rm`, `touch`, `cp`, `mv`, `mkdir`, `rmdir`, etc.
-- Use `man` to read and learn about UNIX utilities.
-- Run `bash` commands from within Jupyter Lab.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data`
-
-== Questions
-
-[IMPORTANT]
-====
-If you are not a `bash` user and you use an alternative shell like `zsh` or `tcsh`, you will want to switch to `bash` for the remainder of the semester, for consistency. Of course, if you plan on just using Jupyter Lab cells, the `%%bash` magic will use `/bin/bash` rather than your default shell, so you will not need to do anything.
-====
-
-[NOTE]
-====
-While it is not _super_ common for us to push a lot of external reading at you (other than the occasional blog post or article), https://learning.oreilly.com/library/view/learning-the-unix/0596002610[this] is an excellent, and _very_ short resource to get you started using a UNIX-like system. We strongly recommend readings chapters: 1, 3, 4, 5, & 7. It is safe to skip chapters 2, 6, and 8.
-====
-
-=== Question 1
-
-Let's ease into this project by taking some time to adjust the environment you will be using the entire semester, to your liking. Begin by launching your Jupyter Lab session from https://ondemand.anvil.rcac.purdue.edu.
-
-Open your settings by navigating to menu:Settings[Advanced Settings Editor].
-
-Explore the settings, and make at least 2 modifications to your environment, and list what you've changed.
-
-Here are some settings Kevin likes:
-
-- menu:Theme[Selected Theme > JupyterLab Dark]
-- menu:Document Manager[Autosave Interval > 30]
-- menu:File Browser[Show hidden files > true]
-- menu:Notebook[Line Wrap > on]
-- menu:Notebook[Show Line Numbers > true]
-- menu:Notebook[Shut down kernel > true]
-
-Dr. Ward does not like to customize his own environment, but he _does_ use the Emacs key bindings.
-
-- menu:Settings[Text Editor Key Map > emacs]
-
-[IMPORTANT]
-====
-Only modify your keybindings if you know what you are doing, and like to use Emacs/Vi/etc.
-====
-
-.Items to submit
-====
-- List (using a markdown cell) of the modifications you made to your environment.
-====
-
-=== Question 2
-
-In the previous project, we used a tool called `awk` to parse through a dataset. This was an example of running bash code using the `f2022-s2023` kernel. Aside from use the `%%bash` magic from the previous project, there are 2 more straightforward ways to run bash code from within Jupyter Lab.
-
-The first method allows you to run a bash command from within the same cell as a cell containing Python code. For example.
-
-[source,ipython]
-----
-!ls
-
-import pandas as pd
-myDF = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
-myDF.head()
-----
-
-[NOTE]
-====
-This does _not_ require you to have other, Python code in the cell. The following is perfectly valid.
-
-[source,ipython]
-----
-!ls
-!ls -la /anvil/projects/tdm/
-----
-
-With that being said, using this method, each line _must_ start with an exclamation point.
-====
-
-The second method is to open up a new terminal session. To do this, go to menu:File[New > Terminal]. This should open a new tab and a shell for you to use. You can make sure the shell is working by typing your first command, `man`.
-
-[source,bash]
-----
-# man is short for manual, to quit, press "q"
-# use "k" or the up arrow to scroll up, or "j" or the down arrow to scroll down.
-man man
-----
-
-Great! Now that you've learned 2 new ways to run `bash` code from within Jupyter Lab, please answer the following question. What is the _absolute path_ of the default directory of your `bash` shell? When we say "default directory" we mean the folder that you are "in" when you first run `bash` code in a Jupyter cell or when you first open a Terminal. This is also referred to as the home directory.
-
-**Relevant topics:** https://the-examples-book.com/starter-guides/unix/pwd[pwd]
-
-.Items to submit
-====
-- The full filepath of the default directory (home directory). Ex: Kevin's is: `/home/x-kamstut` and Dr Ward's is: `/home/x-mdw`.
-- The `bash` code used to show your home directory or current directory (also known as the working directory) when the `bash` shell is first launched.
-====
-
-=== Question 3
-
-It is critical to be able to navigate a UNIX-like operating system. It is likely that you will need to use UNIX or Linux (or a similar system) at some point in your career. Perform the following actions, in order, using the `bash` shell.
-
-[WARNING]
-====
-For the sake of consistency, please run your `bash` code using the `%%bash` magic. This ensures that we are all using the correct shell (there are many shells), and that your work is displayed properly for your grader.
-====
-
-. Write a command to navigate to the directory containing the datasets used in this course: `/anvil/projects/tdm/data`.
-. Print the current working directory, is the result what you expected? Output the `$PWD` variable, using the `echo` command.
-. List the files within the current working directory (excluding subfiles).
-. Without navigating out of `/anvil/projects/tdm/data`, list _all_ of the files within the the `movies_and_tv` directory, _including_ hidden files.
-. Return to your home directory.
-. Write a command to confirm that you are back in the appropriate directory.
-
-[NOTE]
-====
-`/` is commonly referred to as the root directory in a UNIX-like system. Think of it as a folder that contains _every_ other folder in the computer. `/home` is a folder within the root directory. `/home/x-kamstut` is the _absolute path_ of Kevin's home directory. There is a folder called `home` inside the root `/` directory. Inside `home` is another folder named `x-kamstut`, which is Kevin's home directory.
-====
-
-**Relevant topics:** xref:starter-guides:data-science:unix:pwd.adoc[pwd], xref:starter-guides:data-science:unix:cd.adoc[cd], xref:starter-guides:data-science:unix:echo.adoc[echo], xref:starter-guides:data-science:unix:ls.adoc[ls]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-When running the `ls` command (specifically the `ls` command that showed hidden files and folders), you may have noticed two oddities that appeared in the output: "." and "..". `.` represents the directory you are currently in, or, if it is a part of a path, it means "this directory". For example, if you are in the `/anvil/projects/tdm/data` directory, the `.` refers to the `/anvil/projects/tdm/data` directory. If you are running the following bash command, the `.` is redundant and refers to the `/anvil/projects/tdm/data/yelp` directory.
-
-[source,bash]
-----
-ls -la /anvil/projects/tdm/data/yelp/.
-----
-
-`..` represents the parent directory, relative to the rest of the path. For example, if you are in the `/anvil/projects/tdm/data` directory, the `..` refers to the parent directory, `/anvil/projects/tdm`.
-
-Any path that contains either `.` or `..` is called a _relative path_ (because it is _relative_ to the directory you are currently in). Any path that contains the entire path, starting from the root directory, `/`, is called an _absolute path_.
-
-. Write a single command to navigate to our modulefiles directory: `/anvil/projects/tdm/opt/lmod`.
-. Confirm that you are in the correct directory using the `echo` command.
-. Write a single command to navigate back to your home directory, however, rather than using `cd`, `cd ~`, or `cd $HOME` without the path argument, use `cd` and a _relative_ path.
-. Confirm that you are in the corrrect directory using the `echo` command.
-
-[NOTE]
-====
-If you don't fully understand the text above, _please_ take the time to understand it. It will be incredibly helpful to you, not only in this class, but in your career.
-====
-
-**Relevant topics:** xref:starter-guides:data-science:unix:pwd.adoc[pwd], xref:starter-guides:data-science:unix:cd.adoc[cd], xref:starter-guides:data-science:unix:special-symbols.adoc[. & .. & ~]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Your `$HOME` directory is your default directory. You can navigate to your `$HOME` directory using any of the following commands.
-
-[source,bash]
-----
-cd
-cd ~
-cd $HOME
-cd /home/$USER
-----
-
-This is typically where you will work, and where you will store your work (for instance, your completed projects).
-
-[NOTE]
-====
-`$HOME` and `$USER` are environment variables. You can see what they are by typing `echo $HOME` and `echo $USER`. Environment variables are variables that are set by the system, or by the user. To get a list of your terminal session's environment variables, type `env`.
-====
-
-The `/anvil/projects/tdm` space is a directory created for The Data Mine. It holds our datasets (in the `data` directory), as well as data for many of our corporate partners projects.
-
-There exists 1 more important location on each cluster, `scratch`. Your `scratch` directory is located at `/anvil/scratch/$USER`, or, even shorter, `$SCRATCH`. `scratch` is meant for use with _really_ large chunks of data. The quota on Anvil is currently 100TB and 1 million files. You can see your quota and usage on Anvil by running the following command.
-
-[source,bash]
-----
-myquota
-----
-
-[TIP]
-====
-`$SCRATCH` and `$USER` are environment variables. You can see what they are by typing `echo $SCRATCH` and `echo $USER`. `$SCRATCH` contains the absolute path to your scratch directory, and `$USER` contains the username of the current user.
-====
-
-In a `bash` cell, please perform the following operations.
-
-. Navigate to your `scratch` directory.
-. Confirm that you are in the correct location using a command.
-. Execute the `/anvil/projects/tdm/bin/tokei` command, with input `/home/x-kamstut/bin`.
-+
-[NOTE]
-====
-Doug Crabill is the compute wizard for the Statistics department here at Purdue. `~dgc/bin` is a directory (on a different cluster) he has made publicly available with a variety of useful scripts. I've copied over those files to `~x-kamstut/bin`.
-====
-+
-. Output the first 5 lines and last 5 lines of `~x-kamstut/bin/union`.
-. Count the number of lines in the bash script `~x-kamstut/bin/union` (using a UNIX command).
-. How many bytes is the script?
-+
-[CAUTION]
-====
-Be careful. We want the size of the script, not the disk usage.
-====
-+
-. Find the location of the `python3` command.
-
-[TIP]
-====
-Commands often have _options_. _Options_ are features of the program that you can trigger specifically. You can see the options of a command in the DESCRIPTION section of the man pages.
-
-[source,bash]
-----
-man wc
-----
-
-You can see -m, -l, and -w are all options for `wc`. Then, to test the options out, you can try the following examples.
-
-[source,bash]
-----
-# using the default wc command. "/anvil/projects/tdm/data/flights/1987.csv" is the first "argument" given to the command.
-wc /anvil/projects/tdm/data/flights/1987.csv
-
-# to count the lines, use the -l option
-wc -l /anvil/projects/tdm/data/flights/1987.csv
-
-# to count the words, use the -w option
-wc -w /anvil/projects/tdm/data/flights/1987.csv
-
-# you can combine options as well
-wc -w -l /anvil/projects/tdm/data/flights/1987.csv
-
-# some people like to use a single tack `-`
-wc -wl /anvil/projects/tdm/data/flights/1987.csv
-
-# order doesn't matter
-wc -lw /anvil/projects/tdm/data/flights/1987.csv
-----
-====
-
-**Relevant topics:** xref:starter-guides:data-science:unix:pwd.adoc[pwd], xref:starter-guides:data-science:unix:cd.adoc[cd], xref:starter-guides:data-science:unix:head.adoc[head], xref:starter-guides:data-science:unix:tail.adoc[tail], xref:starter-guides:data-science:unix:wc.adoc[wc], xref:starter-guides:data-science:unix:du.adoc[du], xref:starter-guides:data-science:unix:which.adoc[which], xref:starter-guides:data-science:unix:type.adoc[type]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 6
-
-Perform the following operations.
-
-. Navigate to your scratch directory.
-. Copy the following file to your current working directory: `/anvil/projects/tdm/data/movies_and_tv/imdb.db`.
-. Create a new directory called `movies_and_tv` in your current working directory.
-. Move the file, `imdb.db`, from your scratch directory to the newly created `movies_and_tv` directory (inside of scratch).
-. Use `touch` to create a new, empty file called `im_empty.txt` in your scratch directory.
-. Remove the directory, `movies_and_tv`, from your scratch directory, including _all_ of the contents.
-. Remove the file, `im_empty.txt`, from your scratch directory.
-
-**Relevant topics:** xref:starter-guides:data-science:unix:cp.adoc[cp], xref:starter-guides:data-science:unix:rm.adoc[rm], xref:starter-guides:data-science:unix:touch.adoc[touch], xref:starter-guides:data-science:unix:cd.adoc[cd]
-
-=== Question 7
-
-[IMPORTANT]
-====
-This question should be performed by opening a terminal window. menu:File[New > Terminal]. Enter the result/content in a markdown cell in your notebook.
-====
-
-Tab completion is a feature in shells that allows you to tab through options when providing an argument to a command. It is a _really_ useful feature, that you may not know is there unless you are told!
-
-Here is the way it works, in the most common case -- using `cd`. Have a destination in mind, for example `/anvil/projects/tdm/data/flights/`. Type `cd /anvil/`, and press tab. You should be presented with a small list of options -- the folders in the `anvil` directory. Type `p`, then press tab, and it will complete the word for you. Type `t`, then press tab. Finally, press tab, but this time, press tab repeatedly until you've selected `data`. You can then continue to type and press tab as needed.
-
-Below is an image of the absolute path of a file in Anvil. Use `cat` and tab completion to print the contents of that file.
-
-image::figure03.webp[Tab completion, width=792, height=250, loading=lazy, title="Tab completion"]
-
-.Items to submit
-====
-- The content of the file, `hello_there.txt`, in a markdown cell in your notebook.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project03.adoc
deleted file mode 100644
index c0ee7b8dc..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project03.adoc
+++ /dev/null
@@ -1,202 +0,0 @@
-= TDM 20100: Project 3 -- 2022
-
-**Motivation:** The need to search files and datasets based on the text held within is common during various parts of the data wrangling process -- after all, projects in industry will not typically provide you with a path to your dataset and call it a day. `grep` is an extremely powerful UNIX tool that allows you to search text using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/[even professionals can make critical mistakes]. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in.
-
-[NOTE]
-====
-Regular expressions are not something you will be able to completely escape from. They exist in some way, shape, and form in all major programming languages. Even if you are less-interested in UNIX tools (which you shouldn't be, they can be awesome), you should definitely take the time to learn regular expressions.
-====
-
-**Context:** We've just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, `grep`, and experiment with regular expressions using `grep`, R, and later on, Python.
-
-**Scope:** `grep`, regular expression basics, utilizing regular expression tools in R and Python
-
-.Learning Objectives
-****
-- Use `grep` to search for patterns within a dataset.
-- Use `cut` to section off and slice up data from the command line.
-- Use `wc` to count the number of lines of input.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/consumer_complaints/complaints.csv`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-`grep` stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate `grep`, we will be using it with textual data.
-
-Let's assume for a second that we _didn't_ provide you with the location of this projects dataset, and you didn't know the name of the file either. With all of that being said, you _do_ know that it is the only dataset with the text "That's the sort of fraudy fraudulent fraud that Wells Fargo defrauds its fraud-victim customers with. Fraudulently." in it.
-
-[TIP]
-====
-When you search for this sentence in the file, make sure that you type the single quote in "That's" so that you get a regular ASCII single quote. Otherwise, you will not find this sentence. Or, just use a unique _part_ of the sentence that will likely not exist in another file.
-====
-
-Write a `grep` command that finds the dataset. You can start in the `/anvil/projects/tdm/data` directory to reduce the amount of text being searched. In addition, use a wildcard to reduce the directories we search to only directories that start with a `con` inside the `/anvil/projects/tdm/data` directory. Just know that you'd _eventually_ find the file without using the wildcard, but we don't want to waste your time.
-
-[TIP]
-====
-Use `man` to read about some of the options with `grep`. For example, you'll want to search _recursively_ through the entire contents of the directories starting with a `con`.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-In the previous project, you learned about a command that could quickly print out the first _n_ lines of a file. A csv file typically has a header row to explain what data each column holds. Use the command you learned to print out the first line of the file, and _only_ the first line of the file.
-
-Great, now that you know what each column holds, repeat question (1), but, format the output so that it shows the `complaint_id`, `consumer_complaint_narrative`, and the `state`. Print only the first 100 lines (using `head`) so our notebook is not too full of text.
-
-Now, use `cat`, `head`, `tail`, and `cut` to isolate those same 3 columns for the _single_ line where we heard about the "fraudy fraudulent fraud".
-
-[TIP]
-====
-You can find the exact line from the file where the "fraudy fraudulent fraud" occurs, by using the `n` option from `grep`. That will tell you the line number, which can then be used with `head` and `tail` to isolate the single line.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Imagine a scenario where we are dealing with a _much_ bigger dataset. Imagine that we live in the southeast and are really only interested in analyzing the data for Florida, Georgia, Mississippi, Alabama, and South Carolina. In addition, we are only interested in in the `consumer_complaint_narrative`, `state`, `tags`, and `complaint_id`.
-
-Use UNIX tools to, in one line, create a _new_ dataset called `southeast.csv` that only contains the data for the five states mentioned above, and only the columns listed above.
-
-[TIP]
-====
-Be careful you don't accidentally get lines with a word like "CAPITAL" in them (AL is the state code of Alabama and is present in the word "CAPITAL").
-====
-
-How many rows of data remain? How many megabytes is the new file? Use `cut` to isolate _just_ the data we ask for. For example, _just_ print the number of rows, and _just_ print the value (in Mb) of the size of the file.
-
-.this
-----
-20M
-----
-
-.not this
-----
--rw-r--r-- 1 x-kamstut x-tdm-admin 20M Dec 13 10:59 /home/x-kamstut/southeast.csv
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-We want to isolate some of our southeast complaints. Return rows from our new dataset, `southeast.csv`, that have one of the following words: "wow", "irritating", or "rude" followed by at least 1 exclamation mark. Do this with just a single `grep` command. Ignore case (whether or not parts of the "wow", "rude", or "irritating" words are capitalized or not). Limit your output to only 5 rows (using `head`).
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-++++
-
-++++
-
-If you pay attention to the `consumer_complaint_narrative` column in our new dataset, `southeast.csv`, you'll notice that some of the narratives contain dollar amounts in curly braces `{` and `}`. Use `grep` to find the narratives that contain at least one dollar amount enclosed in curly braces. Use `head` to limit output to only the first 5 results.
-
-[TIP]
-====
-Use the option `-E` to use extended regular expressions. This will make your regular expressions less messy (less escaping).
-====
-
-[NOTE]
-====
-There are instances like `{>= $1000000}` and `{ XXXX }`. The first example qualifies, but the second doesn't. Make sure the following are matched:
-
-- {$0.00}
-- { $1,000.00 }
-- {>= $1000000}
-- { >= $1000000 }
-
-And that the following are _not_ matched:
-
-- { XXX }
-- {XXX}
-====
-
-[TIP]
-====
-Regex is hard. Try the following logic.
-
-. Match a "{"
-. Match 0 or more of any character that isn't a-z, A-Z, or 0-9
-. Match 1 or more "$"
-. Match 1 or more of any character that isn't "}"
-. Match "}"
-====
-
-[TIP]
-====
-To verify your answer, the following code should have the following result.
-
-[source,bash]
-----
-grep -E 'regexhere' $HOME/southeast.csv | head -n 5 | cut -d, -f4
-----
-
-.result
-----
-3185125
-3184467
-3183547
-3183544
-3182879
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project04.adoc
deleted file mode 100644
index 95d488dae..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project04.adoc
+++ /dev/null
@@ -1,140 +0,0 @@
-= TDM 20100: Project 4 -- 2022
-
-**Motivation:** Becoming comfortable chaining commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc. While it is always fair to whip together a script using your favorite language, you may find that these UNIX tools are a better fit for your needs.
-
-**Context:** We've been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping.
-
-**Scope:** grep, regular expression basics, UNIX utilities, redirection, piping
-
-.Learning Objectives
-****
-- Use `cut` to section off and slice up data from the command line.
-- Use piping to string UNIX commands together.
-- Use `sort` and it's options to sort data in different ways.
-- Use `head` to isolate n lines of output.
-- Use `wc` to summarize the number of lines in a file or in output.
-- Use `uniq` to filter out non-unique lines.
-- Use `grep` to search files effectively.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/stackoverflow/unprocessed/*`
-- `/anvil/projects/tdm/data/stackoverflow/processed/*`
-- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt`
-
-== Questions
-
-[WARNING]
-====
-For this project, please submit a `.sh` text file with all of you `bash` code written inside of it. This should be submitted _in addition to_ your notebook (the `.ipynb` file). Failing to submit the accompanying `.sh` file may result and points being removed from your final submission. Thanks!
-====
-
-=== Question 1
-
-++++
-
-++++
-
-In a csv file, there are n+1 columns where n is the number of commas (in theory). Take the first line of `unprocessed/2011.csv`, replace all commas with the newline character, `\n`, and use `wc` to count the resulting number of lines. This should approximate how many columns are in the dataset. What is the value?
-
-This can't be right, can it? Print the first 100 lines after using `tr` to replace commas with newlines. What do you notice?
-
-[TIP]
-====
-The newline character in UNIX is `\n`.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-As you can see, csv files are not always so straightforward to parse. For this particular set of questions, we want to focus on using other UNIX tools that are more useful on semi-clean datasets. Take a look at the first few lines of the data in `processed/2011.csv`. How many columns are there?
-
-Take a look at `iowa_liquor_sales_cleaner.txt` -- how many columns does that file have?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-++++
-
-++++
-
-Continuing to look at the `iowa_liquor_sales_cleaner.txt` dataset, what are the 5 largest orders by number of bottles sold? How about by Gallons sold?
-
-[TIP]
-====
-`cat`, `cut`, `sort`, and `head` will be useful.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-What are the different sizes (in ml) that a bottle of liquor comes in?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-https://en.wikipedia.org/wiki/Benford%27s_law[Benford's law] states that the leading digit in real-life sets of numerical data, the leading digit is likely to follow a distinct distribution (see the plot in the https://en.wikipedia.org/wiki/Benford%27s_law[provided link]). By this logic, the dollar amount in the orders should roughly match this, right?
-
-Use any available `bash` tools you'd like to get a good idea of the count or percentage of the sales (in dollars) by starting digit. Are the results expected? Could there be some "funny business" going on? Write 1-2 sentences explaining what you think.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project05.adoc
deleted file mode 100644
index 36e15ef75..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project05.adoc
+++ /dev/null
@@ -1,169 +0,0 @@
-= TDM 20100: Project 5 -- 2022
-
-**Motivation:** `awk` is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like `awk`.
-
-**Context:** This is the first of three projects where we introduce `awk`. `awk` is a powerful tool that can be used to perform a variety of the tasks that we've previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner.
-
-**Scope:** awk, UNIX utilities
-
-.Learning Objectives
-****
-- Use awk to process and manipulate textual data.
-- Use piping and redirection within the terminal to pass around data between utilities.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-While the UNIX tools we've used up to this point are very useful, `awk` enables many new capabilities, and can even replace major functionality of other tools.
-
-In a previous question, we asked you to write a command that printed the number of columns in the dataset. Perform the same operation using `awk`.
-
-Similarly, we've used `head` to print the header line. Use `awk` to do the same.
-
-Similarly, we've used `wc` to count the number of lines in the dataset. Use `awk` to do the same.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-++++
-
-++++
-
-In a previous question, we used `sort` in combination with `uniq` to find the stores with the most number of sales.
-
-Use `awk` to find the 10 stores with the most number of sales. In a previous solution, our output was minimal -- we had a count and a store number. This time, take some time to format the output nicely, _and_ use the store number to find the count (not store name).
-
-[TIP]
-====
-Sorting an array by values in `awk` can be confusing. Check out https://stackoverflow.com/questions/5342782/sort-associative-array-with-awk[this excellent stackoverflow post] to see a couple of ways to do this. "Edit 2" is the easiest one to follow.
-====
-
-[NOTE]
-====
-You can even use the store number to count the number of sales and save the most recent store name for the store number as you go to _print_ the store names with the output.
-====
-
-[TIP]
-====
-You can pipe output to the `column` unix command to get neatly formatted output!
-
-[source,bash]
-----
-man column
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Calculate the total sales (in USD). Do this using _only_ `awk`.
-
-[TIP]
-====
-`gsub` is a powerful awk utility that allows you to replace a string with another string. For example, you could replace all `$`'s in field 2 with nothing by:
-
-----
-gsub(/\$/, "", $2)
-----
-====
-
-[NOTE]
-====
-The `gsub` operation happens in-place. In a nutshell, what this means is that the original field, `$2` is replaced with the result of the `gsub` operation (which removes the dollar signs).
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-Calculate the total sales (in USD) _by county_. Do this using _only_ `awk`. Format your output so it looks like the following.
-
-.output
-----
-FRANKLIN: $386729.06
-HARRISON: $401811.83
-Franklin: $2102880.14
-Harrison: $2109578.24
-----
-
-Notice anything odd about the result? Look carefully at the dataset and suggest an alternative method that would clean up the issue.
-
-[TIP]
-====
-You can see the issue in our tiny sample of output.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-`awk` is extremely powerful, and this liquor dataset is pretty interesting! We haven't covered everything `awk` (and we won't).
-
-Look at the dataset and ask yourself an interesting question about the data. Use `awk` to solve your problem (or, at least, get you closer to answering the question). Explore various stackoverflow questions about `awk` and `awk` guides online. Try to incorporate an `awk` function you haven't used, or a `awk` trick you haven't seen. While this last part is not required, it is highly encouraged and can be a fun way to learn something new.
-
-[NOTE]
-====
-You do not need to limit yourself to _just_ use `awk`, but try to do as much using just `awk` as you are able.
-====
-
-.Items to submit
-====
-- A markdown cell containing the question you are trying to answer about the dataset.
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project06.adoc
deleted file mode 100644
index 3fe6b2d2b..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project06.adoc
+++ /dev/null
@@ -1,112 +0,0 @@
-= TDM 20100: Project 6 -- 2022
-
-**Motivation:** `awk` is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like `awk`.
-
-**Context:** This is the second of three projects where we introduce `awk`. `awk` is a powerful tool that can be used to perform a variety of the tasks that we've previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner.
-
-**Scope:** awk, UNIX utilities
-
-.Learning Objectives
-****
-- Use awk to process and manipulate textual data.
-- Use piping and redirection within the terminal to pass around data between utilities.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/craigslist/vehicles_clean.txt`
-- `/anvil/projects/tdm/data/donorschoose/Donations.csv`
-- `/anvil/projects/tdm/data/whin/weather.csv`
-
-== Questions
-
-=== Question 1
-
-Use `awk` to determine how many columns and rows are in the following dataset: `/anvil/projects/tdm/data/craigslist/vehicles_clean.txt`.
-
-Make sure the output is formatted as follows.
-
-.output
-----
-rows: 12345
-columns: 12345
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-What are the possible "conditions" of the vehicles being sold: `/anvil/projects/tdm/data/craigslist/vehicles_clean.txt`? Use `awk` to answer this question. How many cars of each condition are in the dataset? Make sure to format the output as follows.
-
-.output
-----
-Condition Number of cars
---------- --------------
-AAA 12345
-bb 99999
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Use `awk` to determine the years (for example, 2020, 2021, etc) of the donations in the dataset: `/anvil/projects/tdm/data/donorschoose/Donations.csv`?
-
-[TIP]
-====
-The https://thomas-cokelaer.info/blog/2011/05/awk-the-substr-command-to-select-a-substring/[`substr`] function in `awk` will be useful.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Use `awk` to determine the total donations (in dollars) by year: `/anvil/projects/tdm/data/donorschoose/Donations.csv`?
-
-Use `printf` and the unix `column` utility to format the output as follows.
-
-.output
-----
-Year Donations in dollars
-2020 $1234.56
-2021 $9999.99
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Use `awk` to determine the average `temperature_high` by month: `/anvil/projects/tdm/data/whin/weather.csv`. Make sure the output is sorted by month (you can use `sort` for that). If you are feeling adventurous, try and use `awk` to output a horizontal bar plot using just ascii and `awk`. This last part is _not_ required, but could be fun.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project07.adoc
deleted file mode 100644
index e309669e0..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project07.adoc
+++ /dev/null
@@ -1,351 +0,0 @@
-= TDM 20100: Project 7 -- 2022
-:page-mathjax: true
-
-**Motivation:** `awk` is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like `awk`.
-
-**Context:** This is the third of three projects where we introduce `awk`. `awk` is a powerful tool that can be used to perform a variety of the tasks that we've previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner.
-
-**Scope:** awk, UNIX utilities
-
-.Learning Objectives
-****
-- Use awk to process and manipulate textual data.
-- Use piping and redirection within the terminal to pass around data between utilities.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-Take a look at the dataset. You may have noticed that the "Store Location" column (8th column) contains latitude and longitude coordinates. That is some rich data that could be fun and useful.
-
-The data will look something like the following:
-
-----
-Store Location
-POINT (-91.716615 41.963516)
-POINT (-91.6537 41.987286)
-POINT (-91.52888 40.962331000000006)
-POINT (-93.596755 41.5464)
-POINT (-91.658105 42.010971)
-POINT (-91.494611 41.807199)
-
-POINT (-91.796988 43.307662)
-POINT (-91.358467 41.280183)
-----
-
-What this means is that you can't just parse out the latitude and longitude coordinates and call it a day -- you need to use `awk` functions like `gsub` and `split` to extract the latitude and longitude coordinates.
-
-Use `awk` to print out the latitude and longitude for each line in the original dataset. Output should resemble the following.
-
-----
-lat;lon
-1.23;4.56
-----
-
-[NOTE]
-====
-Make sure to take care of rows that don't have latitude and longitude coordinates -- just skip them. So if your results look like this, you need to add logic to skip the "empty" rows:
-
-----
-
--91.716615 41.963516
--91.6537 41.987286
--91.52888 40.962331000000006
--93.596755 41.5464
--91.658105 42.010971
--91.494611 41.807199
-
--91.796988 43.307662
--91.358467 41.280183
-----
-
-To do this, just go ahead and wrap your print in an if statement similar to:
-
-[source,awk]
-----
-if (length(coords[1]) > ) {
- print coords[1]";"coords[2]
-}
-----
-====
-
-[TIP]
-====
-`split` and `gsub` will be useful `awk` functions to use for this question.
-====
-
-[TIP]
-====
-If we have a bunch of data formatted like the following:
-
-----
-POINT (-91.716615 41.963516)
-----
-
-If we first used `split` to split on "(", for example like:
-
-[source,awk]
-----
-split($8, coords, "(");
-----
-
-`coords[2]` would be:
-
-----
--91.716615 41.963516)
-----
-
-Then, you could use `gsub` to remove any ")" characters from `coords[2]` like:
-
-[source,awk]
-----
-gsub(/\)/, "", coords[2]);
-----
-
-`coords[2]` would be:
-
-----
--91.716615 41.963516
-----
-
-At this point I'm sure you can see how to use `awk` to extract and print the rest!
-====
-
-[IMPORTANT]
-====
-Don't forget any lingering space after the first comma! We don't want that.
-====
-
-[IMPORTANT]
-====
-To verify your `awk` command is correct, pipe the first 10 rows to your `awk` command. The output should be the following.
-
-[source,ipython]
-----
-%%bash
-
-head -n 10 /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | awk -F';' '{}'
-----
-
-.output
-----
-41.963516;-91.716615
-41.987286;-91.6537
-40.962331000000006;-91.52888
-41.5464;-93.596755
-42.010971;-91.658105
-41.807199;-91.494611
-43.307662;-91.796988
-41.280183;-91.358467
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Use `awk` to create a new dataset called `sales_by_store.csv`. Include the `lat` and `lon` you figured out how to parse in the previous question. The final columns should be the following.
-
-.columns
-----
-store_name;date;sold_usd;volume_sold;lat;lon
-----
-
-Please exclude all rows that do not have latitude and longitude values. Save volume sold as liters, not gallons.
-
-[TIP]
-====
-You can output the results of the `awk` command to a new file called `sales_by_store.csv` as follows.
-
-[source,ipython]
-----
-%%bash
-
-awk -F';' {} /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt > $HOME/sales_by_store.csv
-----
-
-The `>` part is a _redirect_. You are redirecting the output from the `awk` command to a new file called `sales_by_store.csv`. If you were to replace `>` by `>>` it would _append_ instead of _replace_. In other words, if you use a single `>` it will first erase the `sales_by_store.csv` file before adding the results of the `awk` command to the file. If you use `>>`, it will append the results. If you use `>>` and append results -- if you were to run the command more than once, the `sales_by_store.csv` file would continue to grow.
-====
-
-[TIP]
-====
-To verify your output, the results from piping the first 10 lines of our dataset to your `awk` command should be the following.
-
-[source,ipython]
-----
-%%bash
-
-head -n 10 /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | awk -F';' '{}'
-----
-
-.output
-----
-store_name;date;sold_usd;volume_sold;lat;lon
-CVS PHARMACY #8443 / CEDAR RAPIDS;08/16/2012;5.25;41.963516;-91.716615
-SMOKIN' JOE'S #6 TOBACCO AND LIQUOR;09/10/2014;9;41.987286;-91.6537
-HY-VEE FOOD STORE / MOUNT PLEASANT;04/10/2013;1.5;40.962331000000006;-91.52888
-AFAL FOOD & LIQUOR / DES MOINES;08/30/2012;1.12;41.5464;-93.596755
-HY-VEE FOOD STORE #5 / CEDAR RAPIDS;01/26/2015;3;42.010971;-91.658105
-SAM'S MAINSTREET MARKET / SOLON;07/19/2012;12;41.807199;-91.494611
-DECORAH MART;10/23/2013;9;43.307662;-91.796988
-ECON-O-MART / COLUMBUS JUNCTION;05/02/2012;2.25;41.280183;-91.358467
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Believe it or not, `awk` even supports geometric calculations like `sin` and `cos`. Write a bash script that, given a pair of latitude and pair of longitude, calculates the distance between the two points.
-
-Okay, so how to get started? To calculate this, we can use https://en.wikipedia.org/wiki/Haversine_formula[the Haversine formula]. The formula is:
-
-$2*r*arcsin(\sqrt{sin^2(\frac{\phi_2 - \phi_1}{2}) + cos(\phi_1)*cos(\phi_2)*sin^2(\frac{\lambda_2 - \lambda_1}{2})})$
-
-Where:
-
-- $r$ is the radius of the Earth in kilometers, we can use: 6367.4447 kilometers
-- $\phi_1$ and $\phi_2$ are the latitude coordinates of the two points
-- $\lambda_1$ and $\lambda_2$ are the longitude coordinates of the two points
-
-In `awk`, `sin` is `sin`, `cos` is `cos`, and `sqrt` is `sqrt`.
-
-To get the `arcsin` use the following `awk` function:
-
-[source,awk]
-----
-function arcsin(x) { return atan2(x, sqrt(1-x*x)) }
-----
-
-To convert from degrees to radians, use the following `awk` function:
-
-[source,awk]
-----
-function dtor(x) { return x*atan2(0, -1)/180 }
-----
-
-The following is how the script should work (with a real example you can test):
-
-[source,ipython]
-----
-%%bash
-
-./question3.sh 40.39978 -91.387531 40.739238 -95.02756
-----
-
-.Results
-----
-309.57
-----
-
-[TIP]
-====
-To include functions in your `awk` command, do as follows:
-
-[source,bash]
-----
-awk -v lat1=$1 -v lat2=$3 -v lon1=$2 -v lon2=$4 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }BEGIN{
- lat1 = dtor(lat1);
- print lat1;
- # rest of your code here!
-}'
-----
-====
-
-[TIP]
-====
-We want you to create a bash script called `question3.sh` in your `$HOME` directory. After you have your bash script, we want you to run it in a bash cell to see the output.
-
-The following is some skeleton code that you can use to get started.
-
-[source,bash]
-----
-#!/bin/bash
-
-lat1=$1
-lat2=$3
-lon1=$2
-lon2=$4
-
-awk -v lat1=$1 -v lat2=$3 -v lon1=$2 -v lon2=$4 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }BEGIN{
- lat1 = dtor(lat1);
- print lat1;
- # rest of your code here!
-}'
-----
-====
-
-[TIP]
-====
-You may need to give your script execute permissions like this.
-
-[source,ipython]
-----
-%%bash
-
-chmod +x $HOME/question3.sh
-----
-====
-
-[TIP]
-====
-Read the https://the-examples-book.com/starter-guides/unix/scripts#shebang[shebang] and https://the-examples-book.com/starter-guides/unix/scripts#arguments[arguments] sections in the book.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-Find the latitude and longitude points for two interesting points on a map (it could be anywhere). Make a note of the locations and the latitude and longitude values for each point in a markdown cell.
-
-Use your `question.sh` script to determine the distance. How close is the distance to the distance you get from an online map app? Pretty close?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project08.adoc
deleted file mode 100644
index d911ecc19..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project08.adoc
+++ /dev/null
@@ -1,213 +0,0 @@
-= TDM 20100: Project 8 -- 2022
-
-**Motivation:** Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. In fact, https://cloudflare.com[Cloudflare], a billion dollar company, had much of its starting infrastructure built on top of a Postgresql database (per https://news.ycombinator.com/item?id=22878136[this thread on hackernews]). Learning SQL is well worth your time!
-
-**Context:** There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we've spent much of this semester in the terminal, we will start in the terminal using SQLite.
-
-**Scope:** SQL, sqlite
-
-.Learning Objectives
-****
-- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
-- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
-- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/movies_and_tv/imdb.db`
-
-In addition, the following is an illustration of the database to help you understand the data.
-
-image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"]
-
-For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above.
-
-To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database.
-
-[source,ipython]
-----
-%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db
-----
-
-For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells.
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-Get started by taking a look at the available tables in the database. What tables are available?
-
-[TIP]
-====
-You'll want to prepend `%%sql` to the top of the cell -- it should be the very first line of the cell (no comments or _anything_ else before it).
-
-[source,ipython]
-----
-%%sql
-
--- Query here
-----
-====
-
-[TIP]
-====
-In sqlite, you can show the tables using the following query:
-
-[source, sql]
-----
-.tables
-----
-
-Unfortunately, sqlite-specific functions can't be run in a Jupyter Lab cell like that. Instead, we need to use a different query.
-
-[source, sql]
-----
-SELECT tbl_name FROM sqlite_master WHERE type='table';
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-++++
-
-++++
-
-It's always a good idea to get an idea what your table(s) looks like. A good way to do this is to get the first 5 rows of data from the table. Write and run 6 queries that return the first 5 rows of data of each table.
-
-To get a better idea of the size of the data, you can use the `count` clause to get the number of rows in each table. Write an run 6 queries that returns the number of rows in each table.
-
-[TIP]
-====
-Run each query in a separate cell, and remember to limit the query to return only 5 rows each.
-
-You can use the `limit` clause to limit the number of rows returned.
-====
-
-**Relevant topics:** xref:programming-languages:SQL:queries.adoc#examples[queries], xref:programming-languages:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-the-first-5-rows-of-the-employees-table[useful example]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-This dataset contains movie data from https://imdb.com (an Amazon company). As you can probably guess, it would be difficult to load the data from those tables into a nice, neat dataframe -- it would just take too much memory on most systems!
-
-Okay, let's dig into the `titles` table a little bit. Run the following query.
-
-[source, sql]
-----
-SELECT * FROM titles LIMIT 5;
-----
-
-As you can see, every row has a `title_id` for the associated title of a movie or tv show (or other). What is this `title_id`? Check out the following link:
-
-https://www.imdb.com/title/tt0903747/
-
-At this point, you may suspect that it is the id imdb uses to identify a movie or tv show. Well, let's see if that is true. Query our database to get any matching titles from the `titles` table matching the `title_id` provided in the link above.
-
-[TIP]
-====
-The `WHERE` clause can be used to filter the results of a query.
-====
-
-**Relevant topics:** xref:programming-languages:SQL:queries.adoc#examples[queries], xref:programming-languages:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-only-employees-with-the-first-name-steve-or-last-name-laura[useful example]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-That is pretty cool! Not only do you understand what the `title_id` means _inside_ the database -- but now you know that you can associate a web page with each `title_id` -- for example, if you run the following query, you will get a `title_id` for a "short" called "Carmencita".
-
-[source, sql]
-----
-SELECT * FROM titles LIMIT 5;
-----
-
-.Output
-----
-title_id, type, ...
-tt0000001, short, ...
-----
-
-If you navigate to https://www.imdb.com/title/tt0000001/, sure enough, you'll see a neatly formatted page with data about the movie!
-
-Okay great. Now, if you take a look at the `episodes` table, you'll see that there are both an `episode_title_id` and `show_title_id` associated with each row.
-
-Let's try and make sense of this the same way we did before. Write a query using the `WHERE` clause to find all rows in the `episodes` table where `episode_title_id` is `tt0903747`. What did you get?
-
-Now, write a query using the `WHERE` clause to find all rows in the `episodes` table where `show_title_id` is `tt0903747`. What did you get?
-
-**Relevant topics:** xref:programming-languages:SQL:queries.adoc#examples[queries], xref:programming-languages:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-only-employees-with-the-first-name-steve-or-last-name-laura[useful example]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-Very interesting! It looks like we didn't get any results when we queried for `episode_title_id` with an id of `tt0903747`, but we did for `show_title_id`. This must mean these ids can represent both a _show_ as well as the _episode_ of a show. By that logic, we should be able to find the _title_ of one of the Breaking Bad episodes, in the same way we found the title of the show itself, right?
-
-Okay, take a look at the results of your second query from question (4). Choose one of the `episode_title_id` values, and query the `titles` table to find the title of that episode.
-
-Finally, in a browser, verify that the title of the episode is correct. To verify this, take the `episode_title_id` and plug it into the following link.
-
-https://www.imdb.com/title//
-
-So, I used `tt1232248` for my query. I would check to make sure it matches this.
-
-https://www.imdb.com/title/tt1232248/
-
-**Relevant topics:** xref:programming-languages:SQL:queries.adoc#examples[queries], xref:programming-languages:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-only-employees-with-the-first-name-steve-or-last-name-laura[useful example]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project09.adoc
deleted file mode 100644
index 18f16b4f7..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project09.adoc
+++ /dev/null
@@ -1,157 +0,0 @@
-= TDM 20100: Project 9 -- 2022
-
-**Motivation:** Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a "bread-and-butter" skill for anyone working with data.
-
-**Context:** We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like `AVG`, `MIN`, and `MAX`.
-
-**Scope:** SQL, sqlite
-
-.Learning Objectives
-****
-- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
-- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
-- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.
-- Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/taxi/taxi_sample.db`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-In previous projects, we used `awk` to parse through and summarize data. While `awk` is extremely convenient and can work well, but SQL is even better.
-
-Write a query that will return the `fare_amount`, `surcharge`, `tip_amount`, and `tolls_amount` as a percentage of `total_amount`.
-
-[IMPORTANT]
-====
-Make sure to limit the output to only 100 rows! Use the `LIMIT` clause to do this.
-====
-
-[TIP]
-====
-Use the `sum` aggregate function to calculate the totals, and division to figure out the percentages.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Check out the `payment_type` column. Write a query that counts the number of each type of `payment_type`. The end result should print something like the following.
-
-.Output sample
-----
-payment_type, count
-CASH, 123
-----
-
-[TIP]
-====
-You can use aliasing to control the output header names.
-====
-
-Write a query that sums the `total_amount` for `payment_type` of "CASH". What is the total amount of cash payments?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Write a query that gets the largest number of passengers in a single trip. How far was the trip? What was the total amount? Answer all of this in a single query.
-
-Whoa, there must be some erroneous data in the database! Not too surprising. Write a query that explores this more, explain what your query does and how it helps you understand what is going on.
-
-[IMPORTANT]
-====
-Make sure all queries limit output to only 100 rows.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-Write a query that gets the average `total_amount` for each year in the database. Which year has the largest average `total_amount`? Use the `pickup_datetime` column to determine the year.
-
-[TIP]
-====
-Read https://www.sqlite.org/lang_datefunc.html[this] page and look at the strftime function.
-====
-
-[TIP]
-====
-If you want the headers to be more descriptive, you can use aliases.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-What percent of data in our database has information on the _location_ of pickup and dropoff? Examine the data, to see if there is a pattern to the rows _with_ that information and _without_ that information.
-
-[TIP]
-====
-There _is_ a distinct pattern. Pay attention to the date and time of the data.
-====
-
-Confirm your hypothesis with the original data set(s) (in `/anvil/projects/tdm/data/taxi/yellow/*.csv`), using bash. This doesn't have to be anything more thorough than running a simple `head` command with a 1-2 sentence explanation.
-
-[TIP]
-====
-Of course, there will probably be some erroneous data for the latitude and longitude columns. However, you could use the `avg` function on a latitude or longitude column, by _year_ to maybe get a pattern.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
-
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project10.adoc
deleted file mode 100644
index 97f821cfd..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project10.adoc
+++ /dev/null
@@ -1,275 +0,0 @@
-= TDM 20100: Project 10 -- 2022
-
-**Motivation:** Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like `MIN`, `MAX`, and `AVG` in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values.
-
-**Context:** We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values!
-
-**Scope:** SQL, SQL in R
-
-.Learning Objectives
-****
-- Demonstrate the ability to interact with popular database management systems within R.
-- Solve data-driven problems using a combination of SQL and R.
-- Basic clauses: SELECT, ORDER BY, LIMIT, DESC, ASC, COUNT, WHERE, FROM, etc.
-- Showcase the ability to filter, alias, and write subqueries.
-- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. Explain when to use having, and when to use where.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/movies_and_tv/imdb.db`
-
-In addition, the following is an illustration of the database to help you understand the data.
-
-image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"]
-
-For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above.
-
-To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database.
-
-[source,ipython]
-----
-%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db
-----
-
-For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells.
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-Let's say we are interested in the Marvel Cinematic Universe (MCU). We could write the following query to get the titles of all the movies in the MCU (at least, available in our database).
-
-[source, sql]
-----
-SELECT premiered, COUNT(*) FROM titles WHERE title_id IN ('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286') GROUP BY premiered;
-----
-
-The result would be a perfectly good-looking table. Now, with that being said, are the headers good-looking? Is it clear what data each column contains? I don't know about you, but `COUNT(*)` as a header is not very clear. xref:programming-languages:SQL:aliasing.adoc[Aliasing] is a great way to not only make the headers look good, but it can also be used to reduce the text in a query by giving some intermediate results a shorter name.
-
-Fix the query so that the headers are `year` and `movie count`, respectively.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Okay, let's say we are interested in modifying our query from question (1) to get the _percentage_ of MCU movies released in each year. Essentially, we want the count for each group, divided by the total count of all the movies in the MCU.
-
-We can achieve this using a _subquery_. A subquery is a query that is used to get a smaller result set from a larger result set.
-
-Write a query that returns the total count of the movies in the MCU, and then use it as a subquery to get the percentage of MCU movies released in each year.
-
-[TIP]
-====
-You do _not_ need to change the query from question (1), rather, you just need to _add_ to the query.
-====
-
-[TIP]
-====
-You can directly divide `COUNT(*)` from the original query by the subquery to get the result!
-====
-
-[WARNING]
-====
-Your initial result may seem _very_ wrong (no fractions at all!) this is OK -- we will fix this in the next question.
-====
-
-[IMPORTANT]
-====
-Use aliasing to rename the new column to `percentage`.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Okay, if you did question (2) correctly, you should have got a result that looks a lot like:
-
-.Output
-----
-year,movie count,percentage
-2008, 2, 0
-2010, 1, 0
-2011, 2, 0
-...
-----
-
-What is going on?
-
-The `AS` keyword can _also_ be used to _cast_ types. Some of you may or may not be familiar with a feature of many programming languages. Common in many programming languages is an "integer" type -- which is for numeric data _without_ a decimal place, and a "float" type -- which is for numeric data _with_ a decimal place. In _many_ languages, if you were to do the following, you'd get what _may_ be unexpected output.
-
-[source,c]
-----
-9/4
-----
-
-.Output
-----
-2
-----
-
-Since both of the values are integers, the result will truncate the decimal place. In other words, the result will be 2, instead of 2.25.
-
-In Python, they've made changes so this doesn't happen.
-
-[source,python]
-----
-9/4
-----
-
-.Output
-----
-2.25
-----
-
-However, if we want the "regular" functionality we can use the `//` operator.
-
-[source,python]
-----
-9//4
-----
-
-.Output
-----
-2
-----
-
-Okay, sqlite does this as well.
-
-[source, sql]
-----
-SELECT 9/4 as result;
-----
-
-.Output
-----
-result
-2
-----
-
-_This_ is why we are getting 0's for the percentage column!
-
-How do we fix this? The following is an example.
-
-[source, sql]
-----
-SELECT CAST(9 AS real)/4 as result;
-----
-
-.Output
-----
-result
-2.25
-----
-
-[NOTE]
-====
-Here, "real" represents "float" or "double" -- it is another way of saying a number with a decimal place.
-====
-
-[IMPORTANT]
-====
-When you do arithmetic with an integer and a real/float, the result will be a real/float. This is why our result is a real even though 50% of our values are integers.
-====
-
-Fix the query so that the results look something like:
-
-.Output
-----
-year, movie count, percentage
-2008, 2, 0.0689...
-2010, 1, 0.034482...
-2011, 2, 0.0689...
-----
-
-[NOTE]
-====
-You can read more about `sqlite3` types https://www.sqlite.org/datatype3.html[here]. In a lot of ways, the `sqlite3` typing system is simpler than typical RDBMS systems, and it other ways it is more complex. `sqlite3` considers their flexible typing https://www.sqlite.org/flextypegood.html[a feature]. However, `sqlite3` does provide https://www.sqlite.org/stricttables.html[strict tables] for individuals who want a more stringent set of typing rules.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-You now know 2 different applications of the `AS` keyword, and you also know how to use a query as a subquery, great!
-
-In the previous project, we were introduced to aggregate functions. We used the GROUP BY clause to group our results by the `premiered` column in this project too! We know we can use the `WHERE` clause to filter our results, but what if we wanted to filter our results based on an aggregated column?
-
-Modify our query from question (3) to print only the rows where the `movie count` is greater than 2.
-
-[TIP]
-====
-See https://www.geeksforgeeks.org/having-vs-where-clause-in-sql/[this article] for more information on the `HAVING` and `WHERE` clauses.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-Write a query that returns the average number of words in the `primary_title` column, by year, and only for years where the average number of words in the `primary_title` is less than 3.
-
-Look at the results. Which year had the lowest average number of words in the `primary_title` column (no need to write another query for this, just eyeball it)?
-
-[TIP]
-====
-See https://stackoverflow.com/questions/3293790/query-to-count-words-sqlite-3[here]. Replace "@String" with the column you want to count the words in.
-====
-
-[TIP]
-====
-If you got it right, there should be 15 rows in the output.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
-
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project11.adoc
deleted file mode 100644
index 69c83325e..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project11.adoc
+++ /dev/null
@@ -1,149 +0,0 @@
-= TDM 20100: Project 11 -- 2022
-
-**Motivation:** Databases are (usually) comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform "joins"! In this project we will explore learn about and practice using joins on our imdb database, as it has many tables where the benefit of joins is obvious.
-
-**Context:** We've introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables.
-
-**Scope:** SQL, sqlite, joins
-
-.Learning Objectives
-****
-- Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem.
-- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING.
-- Showcase the ability to filter, alias, and write subqueries.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/movies_and_tv/imdb.db`
-
-In addition, the following is an illustration of the database to help you understand the data.
-
-image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"]
-
-For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above.
-
-To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database.
-
-[source,ipython]
-----
-%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db
-----
-
-For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells.
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-In the previous project, we provided you with a query to get the number of MCU movies that premiered in each year.
-
-Now that we are learning about _joins_, we have the ability to make much more interesting queries!
-
-Use the provided list of `title_id` values to get a list of the MCU movie `primary_title` values, `premiered` values, and rating (from the provided list of MCU movies).
-
-Which movie had the highest rating? Modify your query to return only the 5 highest and 5 lowest rated movies (again, from the MCU list).
-
-.List of MCU title_ids
-----
-('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286')
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Run the following query.
-
-[source,ipython]
-----
-%%sql
-
-SELECT * FROM titles WHERE title_id IN ('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286');
-----
-
-Pay close attention to the movies in the output. You will notice there are movies presented in this query that are (likely) not in the query results you got for question (1).
-
-Write a query that returns the `primary_title` of those movies _not_ shown in the result of question (1) but that _are_ shown in the result of the query above. You can use the query in question (1) as a subquery to answer this.
-
-Can you notice a pattern to said movies?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-In the previous questions we explored what is _actually_ the difference between an INNER JOIN, and a LEFT JOIN. It is likely you used an INNER JOIN/JOIN in your solution to question (1). As a result, the MCU movies that did not yet have a rating in IMDB are not shown in the output of question (1).
-
-Modify your query from question (1) so that it returns a list of _all_ MCU movies with their associated rating, regardless of whether or not the movie has a rating.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-In the previous project, question (5) asked you to write a query that returns the average number of words in the `primary_title` column, by year, and only for years where the average number of words in the `primary_title` is less than 3.
-
-Okay, great. What would be more interesting would be to see the average number of words in the `primary_title` column for titles with a rating of 8.5 or higher. Write a query to do that. How many words on average does a title with 8.5 or higher rating have?
-
-Write another query that does the same for titles with < 8.5 rating. Is the average title length notably different?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-We have a fun database, and you've learned a new trick (joins). Use your newfound knowledge to write a query that uses joins to accomplish a task you couldn't previously (easily) tackle, and answers a question you are interested in.
-
-Explain what your query does, and talk about the results. Explain why you chose either a LEFT join or INNER join.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
-
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project12.adoc
deleted file mode 100644
index 20d06752f..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project12.adoc
+++ /dev/null
@@ -1,342 +0,0 @@
-= TDM 20100: Project 12 -- 2022
-
-**Motivation:** In the previous projects, you've gained experience writing all types of queries, touching on the majority of the main concepts. One critical concept that we _haven't_ yet done is creating your _own_ database. While typically database administrators and engineers will typically be in charge of large production databases, it is likely that you may need to prop up a small development database for your own use at some point in time (and _many_ of you have had to do so this year!). In this project, we will walk through all of the steps to prop up a simple sqlite database for one of our datasets.
-
-**Context:** This is the final project for the semester, and we will be walking through the useful skill of creating a database and populating it with data. We will (mostly) be using the https://www.sqlite.org/[sqlite3] command line tool to interact with the database.
-
-**Scope:** sql, sqlite, unix
-
-.Learning Objectives
-****
-- Create a sqlite database schema.
-- Populate the database with data using `INSERT` statements.
-- Populate the database with data using the command line interface (CLI) for sqlite3.
-- Run queries on a database.
-- Create an index to speed up queries.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/flights/subset/2007.csv`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-[WARNING]
-====
-For any questions requiring a screenshot be included in your notebook, follow the method described https://the-examples-book.com/projects/current-projects/templates#including-an-image-in-your-notebook[here] in order to add a screenshot to your notebook.
-====
-
-First thing is first, create a new Jupyter Notebook called `firstname-lastname-project12.ipynb`. You will put the text of your solutions in this notebook. Next, in Jupyter Lab, open a fresh terminal window. We will be able to run the `sqlite3` command line tool from the terminal window.
-
-Okay, once completed, the first step is schema creation. First, it is important to note. **The goal of this project is to put the data in `/anvil/projects/tdm/data/flights/subset/2007.csv` into a sqlite database we will call `firstname-lastname-project12.db`.**
-
-With that in mind, run the following (in your terminal) to get a sample of the data.
-
-[source,bash]
-----
-head /anvil/projects/tdm/data/flights/subset/2007.csv
-----
-
-You _should_ receive a result like:
-
-.Output
-----
-Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
-2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0
-2007,1,1,1,1918,1905,2043,2035,WN,462,N370,85,90,74,8,13,SMF,PDX,479,5,6,0,,0,0,0,0,0,0
-2007,1,1,1,2206,2130,2334,2300,WN,1229,N685,88,90,73,34,36,SMF,PDX,479,6,9,0,,0,3,0,0,0,31
-2007,1,1,1,1230,1200,1356,1330,WN,1355,N364,86,90,75,26,30,SMF,PDX,479,3,8,0,,0,23,0,0,0,3
-2007,1,1,1,831,830,957,1000,WN,2278,N480,86,90,74,-3,1,SMF,PDX,479,3,9,0,,0,0,0,0,0,0
-2007,1,1,1,1430,1420,1553,1550,WN,2386,N611SW,83,90,74,3,10,SMF,PDX,479,2,7,0,,0,0,0,0,0,0
-2007,1,1,1,1936,1840,2217,2130,WN,409,N482,101,110,89,47,56,SMF,PHX,647,5,7,0,,0,46,0,0,0,1
-2007,1,1,1,944,935,1223,1225,WN,1131,N749SW,99,110,86,-2,9,SMF,PHX,647,4,9,0,,0,0,0,0,0,0
-2007,1,1,1,1537,1450,1819,1735,WN,1212,N451,102,105,90,44,47,SMF,PHX,647,5,7,0,,0,20,0,0,0,24
-----
-
-An SQL schema is a set of text or code that defines how the database is structured and how each piece of data is stored. In a lot of ways it is similar to how a data.frame has columns with different types -- just more "set in stone" than the very easily changed data.frame.
-
-Each database handles schemas slightly differently. In sqlite, the database will contain a single schema table that describes all included tables, indexes, triggers, views, etc. Specifically, each entry in the `sqlite_schema` table will contain the type, name, tbl_name, rootpage, and sql for the database object.
-
-[NOTE]
-====
-For sqlite, the "database object" could refer to a table, index, view, or trigger.
-====
-
-This detail is more than is needed for right now. If you are interested in learning more, the sqlite documentation is very good, and the relevant page to read about this is https://www.sqlite.org/schematab.html[here].
-
-For _our_ purposes, when I refer to "schema", what I _really_ mean is the set of commands that will build our tables, indexes, views, and triggers. sqlite makes it particularly easy to open up a sqlite database and get the _exact_ commands to build the database from scratch _without_ the data itself. For example, take a look at our `imdb.db` database by running the following in your terminal.
-
-[source,bash]
-----
-sqlite3 /anvil/projects/tdm/data/movies_and_tv/imdb.db
-----
-
-This will open the command line interface (CLI) for sqlite3. It will look similar to:
-
-[source,bash]
-----
-sqlite>
-----
-
-Type `.schema` to see the "schema" for the database.
-
-[NOTE]
-====
-Any command you run in the sqlite CLI that starts with a dot (`.`) is called a "dot command". A dot command is exclusive to sqlite and the same functionality cannot be expected to be available in other SQL tools like Postgresql, MariaDB, or MS SQL. You can list all of the dot commands by typing `.help`.
-====
-
-After running `.schema`, you should see a variety of legitimate SQL commands that will create the structure of your database _without_ the data itself. This is an extremely useful self-documenting tool that is particularly useful.
-
-Okay, great. Now, let's study the sample of our `2007.csv` dataset. Create a markdown list of key:value pairs for each column in the dataset. Each _key_ should be the title of the column, and each _value_ should be the _type_ of data that is stored in that column.
-
-For example:
-
-- Year: INTEGER
-
-Where the _value_ is one of the 5 "affinity types" (INTEGER, TEXT, BLOB, REAL, NUMERIC) in sqlite. See section "3.1.1" https://www.sqlite.org/datatype3.html[here].
-
-Okay, you may be asking, "what is the difference between INTEGER, REAL, and NUMERIC?". Great question. In general (for other SQL RDBMSs), there are _approximate_ numeric data types and _exact_ numeric data types. What you are most familiar with is the _approximate_ numeric data types. In R or Python for example, try running the following:
-
-[source,r]
-----
-(3 - 2.9) <= 0.1
-----
-
-.Output
-----
-FALSE
-----
-
-[source,python]
-----
-(3 - 2.9) <= 0.1
-----
-
-.Output
-----
-False
-----
-
-Under the hood, the values are stored as a very close approximation of the real value. This small amount of error is referred to as floating point error. There are some instances where it is _critical_ that values are stored as exact values (for example, in finance). In those cases, you would need to use special data types to handle it. In sqlite, this type is NUMERIC. So, for _our_ example, store text as TEXT, numbers _without_ decimal places as INTEGER, and numbers with decimal places as REAL -- our example dataset doesn't have a need for NUMERIC.
-
-.Items to submit
-====
-- Screenshot showing the `sqlite3` output when running `.schema` on the `imdb.db` database.
-- A markdown cell containing a list of key value pairs that describe a type for each column in the `2007.csv` dataset.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Okay, great! At this point in time you should have a list of key:value pairs with the column name and the data type, for each column. Now, let's put together our `CREATE TABLE` statement that will create our table in the database.
-
-See https://www.sqlitetutorial.net/sqlite-create-table/[here] for some good examples. Realize that the `CREATE TABLE` statement is not so different from any other query in SQL, and although it looks messy and complicated, it is not so bad. Name your table `flights`.
-
-Once you've written your `CREATE TABLE` statement, create a new, empty database by running the following in a terminal: `sqlite3 $HOME/flights.db`. Copy and paste the `CREATE TABLE` statement into the sqlite CLI. Upon success, you should see the statement printed when running the dot command `.schema`. Fantastic! You can also verify that the table exists by running the dot command `.tables`.
-
-Congratulations! To finish things off, please paste the `CREATE TABLE` statement into a markdown cell in your notebook. In addition, include a screenshot of your `.schema` output after your `CREATE TABLE` statement was run.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-The next step in the project is to add the data! After all, it _is_ a _data_ base.
-
-To insert data into a table _is_ a bit cumbersome. For example, let's say we wanted to add the following row to our `flights` table.
-
-.Data to add
-----
-Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
-2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0
-----
-
-The SQL way would be to run the following query.
-
-[source, sql]
-----
-INSERT INTO flights (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay) VALUES (2007,1,1,1,1232,1225,1341,1340,'WN',2891,'N351',69,75,54,1,7,'SMF','ONT',389,4,11,0,,0,0,0,0,0,0);
-----
-
-NOT ideal -- especially since we have over 7 million rows to add! You could programmatically generate a `.sql` file with the `INSERT INTO` statement, hook the database up with Python or R and insert the data that way, _or_ you could use the wonderful dot commands sqlite already provides.
-
-Insert the data from `2007.csv` into your `flights.db` database. You may find https://stackoverflow.com/questions/13587314/sqlite3-import-csv-exclude-skip-header[this post] very helpful.
-
-[WARNING]
-====
-You want to make sure you _don't_ include the header line twice! If you included the header line twice, you can verify by running the following in the sqlite CLI.
-
-[source,sql]
-----
-.header on
-SELECT * FROM flights LIMIT 2;
-----
-
-The `.header on` dot command will print the header line for every query you run. If you have double entered the header line, it will appear twice. Once for the `.header on` and another time because that is the first row of your dataset.
-====
-
-Connect to your database in your Jupyter notebook and run a query to get the first 5 rows of your table.
-
-[TIP]
-====
-To connect to your database:
-
-[source,ipython]
-----
-%sql sqlite:///$HOME/flights.db
-----
-====
-
-.Items to submit
-====
-- An `sql` cell in your notebook that connects to your database and runs a query to get the first 5 rows of your table.
-- Output from running the code.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-Woohoo! You've successfully created a database and populated it with data from a dataset -- pretty cool! Connect to your databse from inside a terminal.
-
-[source,bash]
-----
-sqlite3 $HOME/flights.db
-----
-
-Now, run the following dot command in order to _time_ our queries: `.timer on`. This will print out the time it takes to run each query. For example, try the following:
-
-[source, sql]
-----
-SELECT * FROM flights LIMIT 5;
-----
-
-Cool! Time the following query.
-
-[source, sql]
-----
-SELECT * FROM flights ORDER BY DepTime LIMIT 1000;
-----
-
-.Output
-----
-Run Time: real 1.824 user 0.836007 sys 0.605384
-----
-
-That is pretty quick, but if (for some odd reason) there were going to be a lot of queries that searched on exact departure times, this could be a big waste of time when done at scale. What can we do to improve this? Add and index!
-
-Run the following query.
-
-[source, sql]
-----
-EXPLAIN QUERY PLAN SELECT * FROM flights WHERE DepTime = 1232;
-----
-
-The output will indicate that the "plan" is to simply scan the entire table. This has a runtime of O(n), which means the speed is linear to the number of values in the table. If we had 1 million rows and it takes 1 second. If we get to a billion rows, it will take 16 minutes! An _index_ is a data structure that will let us reduce the runtime to O(log(n)). This means if we had 1 million rows and it takes 1 second, if we had 1 billion rows, it would take only 3 seconds. _Much_ more efficient! So what is the catch here? Space.
-
-Leave the sqlite CLI by running `.quit`. Now, see how much space your `flights.db` file is using.
-
-[source,bash]
-----
-ls -la $HOME/flights.db
-----
-
-.Output
-----
-545M
-----
-
-Okay, _after_ I add an index on the `DepTime` column, the file is now `623M` -- while that isn't a _huge_ difference, it would certainly be significant if we scaled up the size of our database. In this case, another drawback would be the insert time. Inserting new data into the database would force the database to have to _update_ the indexes. This can add a _lot_ of time. These are just tradeoffs to consider when you're working with a database.
-
-In this case, we don't care about the extra bit of space -- create an index on the `DepTime` column. https://medium.com/@JasonWyatt/squeezing-performance-from-sqlite-indexes-indexes-c4e175f3c346[This article] is a nice easy read that covers this in more detail.
-
-Great! Once you've created your index, run the following query.
-
-[IMPORTANT]
-====
-Make sure you turn on the timer first by running `.timer on`!
-====
-
-[source, sql]
-----
-SELECT * FROM flights ORDER BY DepTime LIMIT 1000;
-----
-
-.Output
-----
-Run Time: real 0.095 user 0.009746 sys 0.014301
-----
-
-Wow! That is some _serious_ improvement. What does the "plan" look like?
-
-[source, sql]
-----
-EXPLAIN QUERY PLAN SELECT * FROM flights WHERE DepTime = 1232;
-----
-
-You'll notice the "plan" shows it will utilize the index to speed the query up. Great!
-
-Finally, take a glimse to see how much space the database takes up now. Mine took 623M! An increase of about 14%. Not bad!
-
-.Items to submit
-====
-- Screenshots of your terminal output showing the following:
- - The size of your database before adding the index.
- - The size of your database after adding the index.
- - The time it took to run the query before adding the index.
- - The time it took to run the query after adding the index.
- - The "plan" for the query before adding the index.
- - The "plan" for the query after adding the index.
-====
-
-=== Question 5
-
-++++
-
-++++
-
-We hope that this project has given you a small glimpse into the "other side" of databases. Now, write a query that uses one or more other columns. Time the query, then, create a _new_ index to speed the query up. Time the query _after_ creating the index. Did it work well?
-
-Document the steps of this problem just like you did for question (4).
-
-**Optional challenge:** Try to make your query utilize 2 columns and create an index on both columns to see if you can get a speedup.
-
-.Items to submit
-====
-- Screenshots of your terminal output showing the following:
- - The size of your database before adding the index.
- - The size of your database after adding the index.
- - The time it took to run the query before adding the index.
- - The time it took to run the query after adding the index.
- - The "plan" for the query before adding the index.
- - The "plan" for the query after adding the index.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project13.adoc
deleted file mode 100644
index 3a151b56a..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project13.adoc
+++ /dev/null
@@ -1,224 +0,0 @@
-= TDM 20100: Project 13 -- 2022
-
-**Motivation:** We've covered a lot about SQL in a relatively short amount of time, but we still haven't touched on some other important SQL topics. In this final project, we will touch on some other important SQL topics.
-
-**Context:** In the previous project, you had the opportunity to take the time to insert data into a `sqlite3` database. There are still many common tasks that you may need to perform using a database: triggers, views, transaction, and even a few `sqlite3`-specific functionalities that may prove useful.
-
-**Scope:** SQL
-
-.Learning Objectives
-****
-- Create a trigger on your `sqlite3` database and demonstrate that it works.
-- Create one or more views on your `sqlite3` database and demonstrate that they work.
-- Describe and use a database transaction. Rollback a transaction.
-- Optionally, use the `sqlite3` "savepoint", "rollback to", and "release" commands.
-- Optionally, use the `sqlite3` "attach" and "detach" commands to execute queries across multiple databases.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/flights/subset/flights_sample.db`
-- `/anvil/projects/tdm/data/movies_and_tv/imdb.db`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-Begin by copying the database from the previous project to your `$HOME` directory. Open up a terminal and run the following.
-
-[source,bash]
-----
-cp /anvil/projects/tdm/data/flights/subset/flights_sample.db $HOME
-----
-
-Go ahead and launch `sqlite3` and connect to the database.
-
-[source,bash]
-----
-sqlite3 $HOME/flights_sample.db
-----
-
-From within `sqlite3`, test things out to make sure the data looks right.
-
-[source, sql]
-----
-.header on
-SELECT * FROM flights LIMIT 5;
-----
-
-.expected output
-----
-Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay
-2007|1|1|1|1232|1225|1341|1340|WN|2891|N351|69|75|54|1|7|SMF|ONT|389|4|11|0||0|0|0|0|0|0
-2007|1|1|1|1918|1905|2043|2035|WN|462|N370|85|90|74|8|13|SMF|PDX|479|5|6|0||0|0|0|0|0|0
-2007|1|1|1|2206|2130|2334|2300|WN|1229|N685|88|90|73|34|36|SMF|PDX|479|6|9|0||0|3|0|0|0|31
-2007|1|1|1|1230|1200|1356|1330|WN|1355|N364|86|90|75|26|30|SMF|PDX|479|3|8|0||0|23|0|0|0|3
-2007|1|1|1|831|830|957|1000|WN|2278|N480|86|90|74|-3|1|SMF|PDX|479|3|9|0||0|0|0|0|0|0
-----
-
-With any luck, things should be working just fine.
-
-Let's go ahead and create a trigger. A trigger is what it sounds like, given a specific action, _do_ a specific action. This is a powerful tool. One of the most common uses of a trigger that you will see in the wild is the "updated_at" field. This is a field that stores a datetime value, and uses a _trigger_ to automatically update to the current date and time anytime a record in the database is updated.
-
-First, we need to create a new column called "updated_at", and set the default value to something. In our case, lets set it to January 1, 1970 at 00:00:00.
-
-[source, sql]
-----
-ALTER TABLE flights ADD COLUMN updated_at DATETIME DEFAULT '1970-01-01 00:00:00';
-----
-
-If you query the table now, you will see all of the values have been properly added, great!
-
-[source, sql]
-----
-SELECT * FROM flights LIMIT 5;
-----
-
-Now add a trigger called "update_updated_at" that will update the "updated_at" column to the current date and time whenever a record is updated. Check out the official documentation https://www.sqlite.org/lang_createtrigger.html[here] for examples of triggers.
-
-Once your trigger has been written, go ahead and test it out by updating the following record.
-
-[source, sql]
-----
-UPDATE flights SET Year = 5555 WHERE Year = 2007 AND Month = 1 AND DayofMonth = 1 AND DayOfWeek = 1 AND DepTime = 1225 AND Origin = 'SMF';
-----
-
-[source, sql]
-----
-SELECT * FROM flights WHERE Year = 5555;
-----
-
-If it worked right, your `updated_at` column should have been updated to the current date and time, cool!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Output from connecting to the database from inside your Jupyter notebook and running the `SELECT * FROM flights WHERE Year = 5555;` query.
-====
-
-=== Question 2
-
-++++
-
-++++
-
-Next, we will touch on _views_. A view is essentially a virtual table that is created from some query and given a name. Why would you want to create such a thing? Well, there could be many reasons.
-
-Maybe you have a complex query that you need to run frequently, and it would just be easier to see the final result with a click? Maybe the database has horrible naming conventions and you want to rename things in a view to make it more readable and/or queryable?
-
-After some thought, it may occur to you that we've had such an instance where a view could be nice using our `imdb.db` database!
-
-Copy the `imdb.db` to your `$SCRATCH` directory, and navigate to your `$SCRATCH` directory.
-
-[source,bash]
-----
-cp /anvil/projects/tdm/data/movies_and_tv/imdb.db $SCRATCH
-cd $SCRATCH
-----
-
-Sometimes, it would be nice to have the `rating` and `votes` from the `ratings` table available directly from the titles table, wouldn't it? It has been a bit of a hassle to access that information and use a JOIN whenever we've had a need to see rating information. In fact, if you think about it, the rating information living in its own table doesn't really make that much sense.
-
-Create a _view_ called `titles_with_ratings` that has all of the information from the `titles` table along with the `rating` and `votes` from the `ratings` table. You can find the official documentation https://www.sqlite.org/lang_createview.html[here].
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Output from connecting to the database from inside your Jupyter notebook and running `SELECT * FROM titles_with_ratings LIMIT 5;` query.
-====
-
-=== Question 3
-
-++++
-
-++++
-
-Read the offical `sqlite3` documentation for transactions https://www.sqlite.org/lang_transaction.html[here]. As you will read, you've already been using transactions each time you run a query! What we will focus on is how to use transactions to _rollback_ changes, as this is probably the most useful use case you'll run into.
-
-Connect to our `flights_sample.db` database from question (1), start a _deferred_ transaction, and update a row, similar to what we did before, using the following query.
-
-[source, sql]
-----
-UPDATE flights SET Year = 7777 WHERE Year = 5555;
-----
-
-Now, query the record to see what it looks like.
-
-[source, sql]
-----
-SELECT * FROM flights WHERE Year = 7777;
-----
-
-[NOTE]
-====
-You'll notice our _trigger_ from before is still working, cool!
-====
-
-This is pretty great, until you realized that the year should most definitely _not_ be 7777, but rather be 5555. Oh no! Well, at this stage you haven't committed your transaction yet, so you can just _rollback_ the changes and everything will be back to normal. Give it a try (again, following the official documentation).
-
-After rolling back, run the following query.
-
-[source, sql]
-----
-SELECT * FROM flights WHERE Year = 7777;
-----
-
-As you can see, nothing appears! Let's try with the correct year.
-
-[source,sql]
-----
-SELECT * FROM flights WHERE Year = 5555;
-----
-
-Nice! Note only was our `Year` field rolled back to the original values after question (1), but our `updated_at` field was too, excellent! As you can imagine, this is pretty powerful stuff, especially if you are writing to a database and want to make sure things look right before _committing_ the changes.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- A screenshot in your Jupyter notebook showing the series of queries that demonstrated your rollback worked as planned.
-====
-
-=== Question 4
-
-++++
-
-++++
-
-SQL and `sqlite3` are powerful tools, and we've barely scratched the surface. Check out the https://www.sqlite.org/docs.html[offical documentation], and demonstrate another feature of `sqlite3` that we haven't yet covered.
-
-Some suggestions, if you aren't interested in browsing the documentation: https://www.sqlite.org/windowfunctions.html#biwinfunc[window functions], https://www.sqlite.org/lang_mathfunc.html[math functions], https://www.sqlite.org/lang_datefunc.html[date and time functions], and https://www.sqlite.org/lang_corefunc.html[core functions] (there are many we didn't use!)
-
-Please make sure the queries you run are run from an sql cell in your Jupyter notebook.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5 (optional, 0 pts)
-
-There are two other interesting features of `sqlite3`: https://www.sqlite.org/lang_savepoint.html[savepoints] (kind of a named transaction) and https://www.sqlite.org/lang_attach.html[attach and detach]. Demonstrate one or both of these functionalities and write 1-2 sentences stating whether or not you think they are practical or useful features, and why or why not?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-projects.adoc
deleted file mode 100644
index 64b081219..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-projects.adoc
+++ /dev/null
@@ -1,41 +0,0 @@
-= TDM 20100
-
-== Project links
-
-[NOTE]
-====
-Only the best 10 of 13 projects will count towards your grade.
-====
-
-[CAUTION]
-====
-Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses.
-====
-
-[%header,format=csv,stripes=even,%autowidth.stretch]
-|===
-include::ROOT:example$20100-2022-projects.csv[]
-|===
-
-[WARNING]
-====
-Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete.
-
-**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information.
-
-Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza.
-====
-
-== Piazza
-
-=== Sign up
-
-https://piazza.com/purdue/fall2022/tdm20100[https://piazza.com/purdue/fall2022/tdm20100]
-
-=== Link
-
-https://piazza.com/purdue/fall2022/tdm20100/home[https://piazza.com/purdue/fall2022/tdm20100/home]
-
-== Syllabus
-
-See xref:fall2022/logistics/syllabus.adoc[here].
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project01.adoc
deleted file mode 100644
index 1e48cdfd2..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project01.adoc
+++ /dev/null
@@ -1,255 +0,0 @@
-= TDM 30100: Project 1 -- 2022
-
-**Motivation:** It’s been a long summer! Last year, you got some exposure command line tools, SQL, Python, and other fun topics like web scraping. This semester, we will continue to work primarily using Python with data. Topics will include things like: documentation using tools like sphinx, or pdoc, writing tests, sharing Python code using tools like pipenv, poetry, and git, interacting with and writing APIs, as well as containerization. Of course, like nearly every other project, we will be be wrestling with data the entire time.
-
-We will start slowly, however, by learning about Jupyter Lab. This year, instead of using RStudio Server, we will be using Jupyter Lab. In this project we will become familiar with the new environment, review some, and prepare for the rest of the semester.
-
-**Context:** This is the first project of the semester! We will start with some review, and set the "scene" to learn about a variety of useful and exciting topics.
-
-**Scope:** Jupyter Lab, R, Python, Anvil, markdown
-
-.Learning Objectives
-****
-- Read about and understand computational resources available to you.
-- Learn how to run R code in Jupyter Lab on Anvil.
-- Review.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/flights/subset/1991.csv`
-- `/anvil/projects/tdm/data/movies_and_tv/imdb.db`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-++++
-
-++++
-
-For this course, projects will be solved using the https://www.rcac.purdue.edu/compute/anvil[Anvil computing cluster].
-
-Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (RAM). Use the information on the provided webpages to calculate how many cores and how much memory is available _in total_ for the Anvil "sub-clusters".
-
-Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer of your own, work with a friend to see how many cores there are, and how much memory is available, on their computer.
-
-[NOTE]
-====
-Last year, we used the https://www.rcac.purdue.edu/compute/brown[Brown computing cluster]. Compare the specs of https://www.rcac.purdue.edu/compute/anvil[Anvil] and https://www.rcac.purdue.edu/compute/brown[Brown] -- which one is more powerful?
-====
-
-.Items to submit
-====
-- A sentence explaining how many cores and how much memory is available, in total, across all nodes in the sub-clusters on Anvil.
-- A sentence explaining how many cores and how much memory is available, in total, for your own computer.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Like the previous year we will be using Jupyter Lab on the Anvil cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster.
-
-Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (and Duo). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward.
-
-[TIP]
-====
-If you did not (yet) setup your 2-factor authentication credentials with Duo, you can go back to Step 9 and setup the credentials here: https://the-examples-book.com/starter-guides/anvil/access-setup
-====
-
-Towards the middle of the top menu, there will be an item labeled btn:[My Interactive Sessions], click on btn:[My Interactive Sessions]. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Bioinformatics, Interactive Apps, and The Data Mine. Under "The Data Mine" section, you should see a button that says btn:[Jupyter Notebook], click on btn:[Jupyter Notebook].
-
-If everything was successful, you should see a screen similar to the following.
-
-image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"]
-
-Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 2 CPU cores and 3800 Mb.
-
-[NOTE]
-====
-It is OK to not understand what that means yet, we will learn more about this in TDM 30100. For the curious, however, if you were to open a terminal session in Anvil and run the following, you would see your job queued up.
-
-[source,bash]
-----
-squeue -u username # replace 'username' with your username
-----
-====
-
-[NOTE]
-====
-If you select 4000 Mb of memory instead of 3800 Mb, you will end up getting 3 CPU cores instead of 2. OnDemand tries to balance the memory to CPU ratio to be _about_ 1900 Mb per CPU core.
-====
-
-We use the Anvil cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine.
-
-After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on btn:[Connect to Jupyter] to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following.
-
-image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"]
-
-There are 2 primary options that you will need to know about.
-
-f2022-s2023::
-The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment.
-
-[TIP]
-====
-To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page].
-====
-
-f2022-s2023-r::
-An alternative, native R kernel that you can use for projects with _just_ R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell.
-
-For now, let's focus on the f2022-s2023 kernel. Click on btn:[f2022-s2023], and a fresh notebook will be created for you.
-
-[NOTE]
-====
-Soon, we'll have the f2022-s2023-r kernel available and ready to use!
-====
-
-Test it out! Run the following code in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on. What is the name of the node on Anvil that you are running on?
-
-[source,python]
-----
-import socket
-print(socket.gethostname())
-----
-
-[TIP]
-====
-To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu.
-====
-
-.Items to submit
-====
-- Code used to solve this problem in a "code" cell.
-- Output from running the code (the name of the node on Anvil that you are running on).
-====
-
-=== Question 3
-
-++++
-
-++++
-
-++++
-
-++++
-
-In the upper right-hand corner of your notebook, you will see the current kernel for the notebook, `f2022-s2023`. If you click on this name you will have the option to swap kernels out -- no need to do this yet, but it is good to know!
-
-Practice running the following examples.
-
-python::
-[source,python]
-----
-my_list = [1, 2, 3]
-print(f'My list is: {my_list}')
-----
-
-SQL::
-[source, sql]
-----
-%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db
-----
-
-[source, ipython]
-----
-%%sql
-
-SELECT * FROM titles LIMIT 5;
-----
-
-[NOTE]
-====
-In a previous semester, you'd need to load the sql extension first -- this is no longer needed as we've made a few improvements!
-
-[source,ipython]
-----
-%load_ext sql
-----
-====
-
-bash::
-[source,bash]
-----
-%%bash
-
-awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' /anvil/projects/tdm/data/flights/subset/1991.csv
-----
-
-[TIP]
-====
-To learn more about how to run various types of code using this kernel, see https://the-examples-book.com/projects/templates[our template page].
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-This year, the first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`).
-
-Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`.
-
-There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default?
-
-Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A.
-
-.Items to submit
-====
-- How many of each types of cells are there in the default template?
-====
-
-=== Question 5
-
-Make a markdown cell containing a list of every topic and/or tool you wish was taught in The Data Mine -- in order of _most_ interested to _least_ interested.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 6
-
-++++
-
-++++
-
-Review your Python, R, and bash skills. For each language, choose at least 1 dataset from `/anvil/projects/tdm/data`, and analyze it. Both solutions should include at least 1 custom function, and at least 1 graphic output.
-
-[NOTE]
-====
-Your `bash` solution can be both plotless and without a custom function.
-====
-
-Make sure your code is complete, and well-commented. Include a markdown cell with your short analysis (1 sentence is fine), for each language.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project02.adoc
deleted file mode 100644
index 403edbcd6..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project02.adoc
+++ /dev/null
@@ -1,275 +0,0 @@
-= TDM 30100: Project 2 -- 2022
-
-**Motivation:** Documentation is one of the most critical parts of a project. https://notion.so[There] https://guides.github.com/features/issues/[are] https://confluence.atlassian.com/alldoc/atlassian-documentation-32243719.html[so] https://docs.github.com/en/communities/documenting-your-project-with-wikis/about-wikis[many] https://www.gitbook.com/[tools] https://readthedocs.org/[that] https://bit.ai/[are] https://clickhelp.com[specifically] https://www.doxygen.nl/index.html[designed] https://www.sphinx-doc.org/en/master/[to] https://docs.python.org/3/library/pydoc.html[help] https://pdoc.dev[document] https://github.com/twisted/pydoctor[a] https://swagger.io/[project], and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can't go wrong with tools like https://www.sphinx-doc.org/en/master/[Sphinx], or https://pdoc.dev[pdoc].
-
-**Context:** This is the first project in a 3-project series where we explore thoroughly documenting Python code, while solving data-driven problems.
-
-**Scope:** Python, documentation
-
-.Learning Objectives
-****
-- Use Sphinx to document a set of Python code.
-- Use pdoc to document a set of Python code.
-- Write and use code that serializes and deserializes data.
-- Learn the pros and cons of various serialization formats.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/apple/health/watch_dump.xml`
-
-== Questions
-
-In this project we will work with `pdoc` to build some simple documentation, review some Python skills that may be rusty, and learn about a serialization and deserialization of data -- a common component to many data science and computer science projects, and a key topics to understand when working with APIs.
-
-For the sake of clarity, this project will have more deliverables than the "standard" `.ipynb` notebook, `.py` file containing Python code, and PDF. In this project, we will ask you to submit an additional PDF showing the documentation webpage that you will have built by the end of the project. How to do this will be made clear in the given question.
-
-[WARNING]
-====
-Make sure to select 4096 MB of RAM for this project. Otherwise you may get an issue reading the dataset in question 3.
-====
-
-=== Question 1
-
-Let's start by navigating to https://ondemand.anvil.rcac.purdue.edu, and launching a Jupyter Lab instance. In the previous project, you learned how to run various types of code in a Jupyter notebook (the `.ipynb` file). Jupyter Lab is actually _much_ more useful. You can open terminals on Anvil (the cluster), as well as open a an editor for `.R` files, `.py` files, or any other text-based file.
-
-Give it a try. In the "Other" category in the Jupyter Lab home page, where you would normally select the "f2022-s2023" kernel, instead select the "Python File" option. Upon clicking the square, you will be presented with a file called `untitled.py`. Rename this file to `firstname-lastname-project02.py` (where `firstname` and `lastname` are your first and last name, respectively).
-
-[TIP]
-====
-Make sure you are in your `$HOME` directory when clicking the "Python File" square. Otherwise you may get an error stating you do not have permissions to create the file.
-====
-
-Read the https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings["3.8.2 Modules" section] of Google's Python Style Guide. Each individual `.py` file is called a Python "module". It is good practice to include a module-level docstring at the top of each module. Create a module-level docstring for your new module. Rather than giving an explanation of the module, and usage examples, instead include a short description (in your own words, 3-4 sentences) of the terms "serialization" and "deserialization". In addition, list a few (at least 2) examples of different serialization formats, and include a brief description of the format, and some advantages and disadvantages of each. Lastly, if you could break all serialization formats into 2 broad categories, what would those categories be, and why?
-
-[TIP]
-====
-Any good answer for the "2 broad categories" will be accepted. With that being said, a hint would be to think of what the **serialized** data _looks_ like (if you tried to open it in a text editor, for example), or how it is _read_.
-====
-
-Save your module.
-
-**Relevant topics:** xref:programming-languages:python:pdoc.adoc[pdoc], xref:programming-languages:python:sphinx.adoc[Sphinx], xref:programming-languages:python:docstrings-and-comments.adoc[Docstrings & Comments]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Now, in Jupyter Lab, open a new notebook using the "f2022-s2023" kernel.
-
-[TIP]
-====
-You can have _both_ the Python file _and_ the notebook open in separate Jupyter Lab tabs for easier navigation.
-====
-
-Fill in a code cell for question 1 with a Python comment.
-
-[source,python]
-----
-# See firstname-lastname-project02.py
-----
-
-For this question, read the xref:programming-languages:python:pdoc.adoc[pdoc section], and run a `bash` command to generate the documentation for your module that you created in the previous question, `firstname-lastname-project02.py`. To do this, look at the example provided in the book. Everywhere in the example in the pdoc section of the book where you see "mymodule.py" replace it with _your_ module's name -- `firstname-lastname-project02.py`.
-
-[CAUTION]
-====
-Use `python3` **not** `python` in your command.
-
-We are expecting you to run the command in a `bash` cell, however, if you decide to run it in a terminal, please make sure to document your command. In addition, you'll need to run the following in order for `pdoc` to be recognized as a module.
-
-[source,bash]
-----
-module use /anvil/projects/tdm/opt/core
-module load tdm
-module load python/f2022-s2023
-----
-
-Then you can run your command.
-
-[source,bash]
-----
-python3 -m pdoc other commands here
-----
-====
-
-[TIP]
-====
-Use the `-o` flag to specify the output directory -- I would _suggest_ making it somewhere in your `$HOME` directory to avoid permissions issues.
-
-For example, I used `$HOME/output`.
-====
-
-Once complete, on the left-hand side of the Jupyter Lab interface, navigate to your output directory. You should see something called `firstname-lastname-project02.html`. To view this file in your browser, right click on the file, and select btn:[Open in New Browser Tab]. A new browser tab should open with your freshly made documentation. Pretty cool!
-
-[IMPORTANT]
-====
-Ignore the `index.html` file -- we are looking for the `firstname-lastname-project02.html` file.
-====
-
-[TIP]
-====
-You _may_ have noticed that the docstrings are (partially) markdown-friendly. Try introducing some markdown formatting in your docstring for more appealing documentation.
-====
-
-**Relevant topics:** xref:programming-languages:python:pdoc.adoc[pdoc], xref:programming-languages:python:sphinx.adoc[Sphinx], xref:programming-languages:python:docstrings-and-comments.adoc[Docstrings & Comments]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-[NOTE]
-====
-When I refer to "watch data" I just mean the dataset for this project.
-====
-
-Write a function to called `get_records_for_date` that accepts an `lxml` etree (of our watch data, via `etree.parse`), and a `datetime.date`, and returns a list of Record Elements, for a given date. Raise a `TypeError` if the date is not a `datetime.date`, or if the etree is not an `lxml.etree`.
-
-Use the https://google.github.io/styleguide/pyguide.html#383-functions-and-methods[Google Python Style Guide's "Functions and Methods" section] to write the docstring for this function. Be sure to include type annotations for the parameters and return value.
-
-Re-generate your documentation. How does the updated documentation look? You may notice that the formatting is pretty ugly and things like "Args" or "Returns" are not really formatted in a way that makes it easy to read.
-
-Use the `-d` flag to specify the format as "google", and re-generate your documentation. How does the updated documentation look?
-
-[TIP]
-====
-The following code should help get you started.
-
-[source,python]
-----
-import lxml
-import lxml.etree
-from datetime import datetime, date
-
-def get_records_for_date(tree: lxml.etree._ElementTree, for_date: date) -> list[lxml.etree._Element]:
- # docstring goes here
-
- # test if `tree` is an `lxml.etree._ElementTree`, and raise TypeError if not
-
- # test if `for_date` is a `datetime.date`, and raise TypeError if not
-
- # loop through the records in the watch data using the xpath expression `/HealthData/Record`
- # how to see a record, in case you want to
- print(lxml.etree.tostring(record))
-
- # test if the record's `startDate` is the same as `for_date`, and append to a list if it is
-
- # return the list of records
-
-# how to test this function
-tree = etree.parse('/anvil/projects/tdm/data/apple/health/watch_dump.xml')
-chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date()
-my_records = get_records_for_date(tree, chosen_date)
-my_records
-----
-
-.output
-----
-[,
- ,
- ,
- ,
- ,
- ,
- ,
- ,
- ,
- ,
- ,
- ,
- ....
-----
-====
-
-[TIP]
-====
-The following is some code that will be helpful to test the types.
-
-[source,python]
-----
-from datetime import datetime, date
-
-isinstance(some_date_object, date) # test if some_date_object is a date
-isinstance(some_xml_tree_object, lxml.etree._ElementTree) # test if some_xml_tree_object is an lxml.etree._ElementTree
-----
-====
-
-[TIP]
-====
-To loop through records, you can use the `xpath` method.
-
-[source,python]
-----
-for record in tree.xpath('/HealthData/Record'):
- # do something with record
-----
-====
-
-**Relevant topics:** xref:programming-languages:python:pdoc.adoc[pdoc], xref:programming-languages:python:sphinx.adoc[Sphinx], xref:programming-languages:python:docstrings-and-comments.adoc[Docstrings & Comments]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-This was _hopefully_ a not-too-difficult project that gave you some exposure to tools in the Python ecosystem, as well as chipped away at any rust you may have had with writing Python code.
-
-Finally, investigate the https://pdoc.dev/docs/pdoc.html[official pdoc documentation], and make at least 2 changes/customizations to your module. Some examples are below -- feel free to get creative and do something with pdoc outside of this list of options:
-
-- Modify the module so you do not need to pass the `-d` flag in order to let pdoc know that you are using Google-style docstrings.
-- Change the logo of the documentation to your own logo (or any logo you'd like).
-- Add some math formulas and change the output accordingly.
-- Edit and customize pdoc's jinja2 template (or CSS).
-
-[CAUTION]
-====
-For this project, please submit the following files:
-
-- The `.ipynb` file with:
- - a simple comment for question 1,
- - a `bash` cell for question 2 with code that generates your `pdoc` html documentation,
- - a code cell with your `get_records_for_date` function (for question 3)
- - a code cell with the results of running
- +
-[source,python]
-----
-# read in the watch data
-tree = lxml.etree.parse('/anvil/projects/tdm/data/apple/health/watch_dump.xml')
-
-chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date()
-my_records = get_records_for_date(tree, chosen_date)
-my_records
-----
- - a `bash` code cell with the code that generates your `pdoc` html documentation (using the google styles)
- - a markdown cell describing the changes you made for question 4.
-- An `.html` file with your newest set of documention (including your question 4 modifications)
-====
-
-**Relevant topics:** xref:programming-languages:python:pdoc.adoc[pdoc], xref:programming-languages:python:sphinx.adoc[Sphinx], xref:programming-languages:python:docstrings-and-comments.adoc[Docstrings & Comments]
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project03.adoc
deleted file mode 100644
index 26634e5aa..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project03.adoc
+++ /dev/null
@@ -1,459 +0,0 @@
-= TDM 30100: Project 3 -- 2022
-
-**Motivation:** Documentation is one of the most critical parts of a project. https://notion.so[There] https://guides.github.com/features/issues/[are] https://confluence.atlassian.com/alldoc/atlassian-documentation-32243719.html[so] https://docs.github.com/en/communities/documenting-your-project-with-wikis/about-wikis[many] https://www.gitbook.com/[tools] https://readthedocs.org/[that] https://bit.ai/[are] https://clickhelp.com[specifically] https://www.doxygen.nl/index.html[designed] https://www.sphinx-doc.org/en/master/[to] https://docs.python.org/3/library/pydoc.html[help] https://pdoc.dev[document] https://github.com/twisted/pydoctor[a] https://swagger.io/[project], and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can't go wrong with tools like https://www.sphinx-doc.org/en/master/[Sphinx], or https://pdoc.dev[pdoc].
-
-**Context:** This is the second project in a 2-project series where we explore thoroughly documenting Python code, while solving data-driven problems.
-
-**Scope:** Python, documentation
-
-.Learning Objectives
-****
-- Use Sphinx to document a set of Python code.
-- Use pdoc to document a set of Python code.
-- Write and use code that serializes and deserializes data.
-- Learn the pros and cons of various serialization formats.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/apple/health/watch_dump.xml`
-
-== Questions
-
-[WARNING]
-====
-Please use Firefox for this project. We can't guarantee good results if you do not.
-
-Before you begin, open Firefox, and where you would normally put a URL, type the following, followed by enter/return.
-
-```
-about:config
-```
-
-Search for "network.cookie.sameSite.laxByDefault", and change the value to `false`, and close the tab.
-====
-
-=== Question 1
-
-. Create a new directory in your `$HOME` directory called `project03`: `$HOME/project03`
-. Create a new Jupyter notebook in that folder called project03.ipynb, based on the normal project template: `$HOME/project03/project03.ipynb`
-+
-[IMPORTANT]
-====
-The majority of this notebook will just contain a single bash cell with the commands used to re-generate the documentation. This is okay, and by design. The main deliverable for this project will end up being some output from the documentation generator -- this will be explicitly specified as we go along and at the end of the project.
-====
-. Create a module called `firstname_lastname_project03.py` in your `$HOME/project03` directory, with the following contents.
-+
-[source,python]
-----
-"""This module is for project 3 for TDM 30100.
-
-**Serialization:** Serialization is the process of taking a set or subset of data and transforming it into a specific file format that is designed for transmission over a network, storage, or some other specific use-case.
-
-**Deserialization:** Deserialization is the opposite process from serialization where the serialized data is reverted back into its original form.
-
-The following are some common serialization formats:
-
-- JSON
-- Bincode
-- MessagePack
-- YAML
-- TOML
-- Pickle
-- BSON
-- CBOR
-- Parquet
-- XML
-- Protobuf
-
-**JSON:** One of the more wide-spread serialization formats, JSON has the advantages that it is human readable, and has a excellent set of optimized tools written to serialize and deserialize. In addition, it has first-rate support in browsers. A disadvantage is that it is not a fantastic format storage-wise (it takes up lots of space), and parsing large JSON files can use a lot of memory.
-
-**MessagePack:** MessagePack is a non-human-readable file format (binary) that is extremely fast to serialize and deserialize, and is extremely efficient space-wise. It has excellent tooling in many different languages. It is still not the *most* space efficient, or *fastest* to serialize/deserialize, and remains impossible to work with in its serialized form.
-
-Generally, each format is either *human-readable* or *not*. Human readable formats are able to be read by a human when opened up in a text editor, for example. Non human-readable formats are typically in some binary format and will look like random nonsense when opened in a text editor.
-"""
-
-
-import lxml
-import lxml.etree
-from datetime import datetime, date
-
-
-def get_records_for_date(tree: lxml.etree._ElementTree, for_date: date) -> list:
- """
- Given an `lxml.etree` object and a `datetime.date` object, return a list of records
- with the startDate equal to `for_date`.
- Args:
- tree (lxml.etree): The watch_dump.xml file as an `lxml.etree` object.
- for_date (datetime.date): The date for which returned records should have a startDate equal to.
- Raises:
- TypeError: If `tree` is not an `lxml.etree` object.
- TypeError: If `for_date` is not a `datetime.date` object.
- Returns:
- list: A list of records with the startDate equal to `for_date`.
- """
-
- if not isinstance(tree, lxml.etree._ElementTree):
- raise TypeError('tree must be an lxml.etree')
-
- if not isinstance(for_date, date):
- raise TypeError('for_date must be a datetime.date')
-
- results = []
- for record in tree.xpath('/HealthData/Record'):
- if for_date == datetime.strptime(record.attrib.get('startDate'), '%Y-%m-%d %X %z').date():
- results.append(record)
-
- return results
-----
-+
-[IMPORTANT]
-====
-Make sure you change "firstname" and "lastname" to _your_ first and last name.
-====
-+
-. In a `bash` cell in your `project03.ipynb` notebook, run the following.
-+
-[source,ipython]
-----
-%%bash
-
-cd $HOME/project03
-python3 -m sphinx.cmd.quickstart ./docs -q -p project03 -a "Firstname Lastname" -v 1.0.0 --sep
-----
-+
-[IMPORTANT]
-====
-Please replace "Firstname" and "Lastname" with your own name.
-====
-+
-[NOTE]
-====
-What do all of these arguments do? Check out https://www.sphinx-doc.org/en/master/man/sphinx-quickstart.html[this page of the official documentation].
-====
-
-You should be left with a newly created `docs` directory within your `project03` directory: `$HOME/project03/docs`. The directory structure should look similar to the following.
-
-.contents
-----
-project03<1>
-├── 39000_f2021_project03_solutions.ipynb<2>
-├── docs<3>
-│ ├── build <4>
-│ ├── make.bat
-│ ├── Makefile <5>
-│ └── source <6>
-│ ├── conf.py <7>
-│ ├── index.rst <8>
-│ ├── _static
-│ └── _templates
-└── kevin_amstutz_project03.py<9>
-
-5 directories, 6 files
-----
-
-<1> Our module (named `project03`) folder
-<2> Your project notebook (probably named something like `firstname_lastname_project03.ipynb`)
-<3> Your documentation folder
-<4> Your empty build folder where generated documentation will be stored
-<5> The Makefile used to run the commands that generate your documentation
-<6> Your source folder. This folder contains all hand-typed documentation
-<7> Your conf.py file. This file contains the configuration for your documentation.
-<8> Your index.rst file. This file (and all files ending in `.rst`) is written in https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html[reStructuredText] -- a Markdown-like syntax.
-<9> Your module. This is the module containing the code from the previous project, with nice, clean docstrings.
-
-Please make the following modifications:
-
-. To Makefile:
-+
-[source,bash]
-----
-# replace
-SPHINXOPTS ?=
-SPHINXBUILD ?= sphinx-build
-SOURCEDIR = source
-BUILDDIR = build
-
-# with the following
-SPHINXOPTS ?=
-SPHINXBUILD ?= python3 -m sphinx.cmd.build
-SOURCEDIR = source
-BUILDDIR = build
-----
-+
-. To conf.py:
-+
-[source,python]
-----
-# CHANGE THE FOLLOWING CONTENT FROM:
-
-# -- Path setup --------------------------------------------------------------
-
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-#
-# import os
-# import sys
-# sys.path.insert(0, os.path.abspath('.')
-
-# TO:
-
-# -- Path setup --------------------------------------------------------------
-
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-#
-import os
-import sys
-sys.path.insert(0, os.path.abspath('../..'))
-----
-
-Finally, with the modifications above having been made, run the following command in a `bash` cell in Jupyter notebook to generate your documentation.
-
-[source,bash]
-----
-cd $HOME/project03/docs
-make html
-----
-
-After complete, your module folders structure should look something like the following.
-
-.structure
-----
-project03
-├── 39000_f2021_project03_solutions.ipynb
-├── docs
-│ ├── build
-│ │ ├── doctrees
-│ │ │ ├── environment.pickle
-│ │ │ └── index.doctree
-│ │ └── html
-│ │ ├── genindex.html
-│ │ ├── index.html
-│ │ ├── objects.inv
-│ │ ├── search.html
-│ │ ├── searchindex.js
-│ │ ├── _sources
-│ │ │ └── index.rst.txt
-│ │ └── _static
-│ │ ├── alabaster.css
-│ │ ├── basic.css
-│ │ ├── custom.css
-│ │ ├── doctools.js
-│ │ ├── documentation_options.js
-│ │ ├── file.png
-│ │ ├── jquery-3.5.1.js
-│ │ ├── jquery.js
-│ │ ├── language_data.js
-│ │ ├── minus.png
-│ │ ├── plus.png
-│ │ ├── pygments.css
-│ │ ├── searchtools.js
-│ │ ├── underscore-1.13.1.js
-│ │ └── underscore.js
-│ ├── make.bat
-│ ├── Makefile
-│ └── source
-│ ├── conf.py
-│ ├── index.rst
-│ ├── _static
-│ └── _templates
-└── kevin_amstutz_project03.py
-
-9 directories, 29 files
-----
-
-Finally, let's take a look at the results! In the left-hand pane in the Jupyter Lab interface, navigate to `$HOME/project03/docs/build/html/`, and right click on the `index.html` file and choose btn:[Open in New Browser Tab]. You should now be able to see your documentation in a new tab. It should look something like the following.
-
-image::figure34.webp[Resulting Sphinx output, width=792, height=500, loading=lazy, title="Resulting Sphinx output"]
-
-[IMPORTANT]
-====
-Make sure you are able to generate the documentation before you proceed, otherwise, you will not be able to continue to modify, regenerate, and view your documentation.
-====
-
-.Items to submit
-====
-- Code used to solve this problem (in 2 Jupyter `bash` cells).
-====
-
-=== Question 2
-
-One of the most important documents in any package or project is the `README.md` file. This file is so important that version control companies like GitHub and GitLab will automatically display it below the repositories contents. This file contains things like instructions on how to install the packages, usage examples, lists of dependencies, license links, etc. Check out some popular GitHub repositories for projects like `numpy`, `pytorch`, or any other repository you've come across that you believe does a good job explaining the project.
-
-In the `docs/source` folder, create a new file called `README.rst`. Choose 3-5 of the following "types" of reStruturedText from the https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html[this webpage], and create a fake README. The content can be https://www.lipsum.com/[Lorem Ipsum] type of content as long as it demonstrates 3-5 of the types of reStruturedText.
-
-- Inline markup
-- Lists and quote-like blocks
-- Literal blocks
-- Doctest blocks
-- Tables
-- Hyperlinks
-- Sections
-- Field lists
-- Roles
-- Images
-- Footnotes
-- Citations
-- Etc.
-
-[IMPORTANT]
-====
-Make sure to include at least 1 https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#sections[section]. This counts as 1 of your 3-5.
-====
-
-Once complete, add a reference to your README to the `index.rst` file. To add a reference to your `README.rst` file, open the `index.rst` file in an editor and add "README" as follows.
-
-.index.rst
-[source,rst]
-----
-.. project3 documentation master file, created by
- sphinx-quickstart on Wed Sep 1 09:38:12 2021.
- You can adapt this file completely to your liking, but it should at least
- contain the root `toctree` directive.
-
-Welcome to project3's documentation!
-====================================
-
-.. toctree::
- :maxdepth: 2
- :caption: Contents:
-
- README
-
-Indices and tables
-==================
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
-----
-
-[IMPORTANT]
-====
-Make sure "README" is aligned with ":caption:" -- it should be 3 spaces from the left before the "R" in "README".
-====
-
-In a new `bash` cell in your notebook, regenerate your documentation.
-
-[source,ipython]
-----
-%%bash
-
-cd $HOME/project03/docs
-make html
-----
-
-Check out the resulting `index.html` page, and click on the links. Pretty great!
-
-[TIP]
-====
-Things should look similar to the following images.
-
-image::figure35.webp[Sphinx output, width=792, height=500, loading=lazy, title="Sphinx output"]
-
-image::figure36.webp[Sphinx output, width=792, height=500, loading=lazy, title="Sphinx output"]
-====
-
-.Items to submit
-====
-- Screenshot labeled "question02_results". Make sure you https://the-examples-book.com/projects/templates#including-an-image-in-your-notebook[include your screenshot correctly].
-- OR a PDF created by exporting the webpage.
-====
-
-=== Question 3
-
-The `pdoc` package was specifically designed to generate documentation for Python modules using the docstrings _in_ the module. As you may have noticed, this is not "native" to Sphinx.
-
-Sphinx has https://www.sphinx-doc.org/en/master/usage/extensions/index.html[extensions]. One such extension is the https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html[autodoc] extension. This extension provides the same sort of functionality that `pdoc` provides natively.
-
-To use this extension, modify the `conf.py` file in the `docs/source` folder.
-
-[source,python]
-----
-# -- General configuration ---------------------------------------------------
-
-# Add any Sphinx extension module names here, as strings. They can be
-# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
-# ones.
-extensions = [
- 'sphinx.ext.autodoc'
-]
-----
-
-Next, update your `index.rst` file so autodoc knows which modules to extract data from.
-
-[source,rst]
-----
-.. project3 documentation master file, created by
- sphinx-quickstart on Wed Sep 1 09:38:12 2021.
- You can adapt this file completely to your liking, but it should at least
- contain the root `toctree` directive.
-
-Welcome to project3's documentation!
-====================================
-
-.. automodule:: firstname_lastname_project03
- :members:
-
-.. toctree::
- :maxdepth: 2
- :caption: Contents:
-
- README
-
-Indices and tables
-==================
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
-----
-
-In a new `bash` cell in your notebook, regenerate your documentation. Check out the resulting `index.html` page, and click on the links. Not too bad!
-
-.Items to submit
-====
-- Screenshot labeled "question03_results". Make sure you https://the-examples-book.com/projects/templates#including-an-image-in-your-notebook[include your screenshot correctly].
-- OR a PDF created by exporting the webpage.
-====
-
-=== Question 4
-
-Okay, while the documentation looks pretty good, clearly, Sphinx does _not_ recognize Google style docstrings. As you may have guessed, there is an extension for that.
-
-Add the `napoleon` extension to your `conf.py` file.
-
-[source,python]
-----
-# -- General configuration ---------------------------------------------------
-
-# Add any Sphinx extension module names here, as strings. They can be
-# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
-# ones.
-extensions = [
- 'sphinx.ext.autodoc',
- 'sphinx.ext.napoleon'
-]
-----
-
-In a new `bash` cell in your notebook, regenerate your documentation. Check out the resulting `index.html` page, and click on the links. Much better!
-
-.Items to submit
-====
-- Screenshot labeled "question04_results". Make sure you https://the-examples-book.com/projects/templates#including-an-image-in-your-notebook[include your screenshot correctly].
-- OR a PDF created by exporting the webpage.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project04.adoc
deleted file mode 100644
index 051d464e4..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project04.adoc
+++ /dev/null
@@ -1,205 +0,0 @@
-= TDM 30100: Project 4 -- 2022
-
-**Motivation:** Documentation is one of the most critical parts of a project. https://notion.so[There] https://guides.github.com/features/issues/[are] https://confluence.atlassian.com/alldoc/atlassian-documentation-32243719.html[so] https://docs.github.com/en/communities/documenting-your-project-with-wikis/about-wikis[many] https://www.gitbook.com/[tools] https://readthedocs.org/[that] https://bit.ai/[are] https://clickhelp.com[specifically] https://www.doxygen.nl/index.html[designed] https://www.sphinx-doc.org/en/master/[to] https://docs.python.org/3/library/pydoc.html[help] https://pdoc.dev[document] https://github.com/twisted/pydoctor[a] https://swagger.io/[project], and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can't go wrong with tools like https://www.sphinx-doc.org/en/master/[Sphinx], or https://pdoc.dev[pdoc].
-
-**Context:** This is the third project in a 3-project series where we explore thoroughly documenting Python code, while solving data-driven problems.
-
-**Scope:** Python, documentation
-
-.Learning Objectives
-****
-- Use Sphinx to document a set of Python code.
-- Use pdoc to document a set of Python code.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_book_series.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_books.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json`
-
-== Questions
-
-=== Question 1
-
-The listed datasets are fairly large, and interesting! They are `json` formatted data. Each _row_ of a single `json` file can be individually read in and processed. Take a look at a single row.
-
-[source,ipython]
-----
-%%bash
-
-head -n 1 /anvil/projects/tdm/data/goodreads/goodreads_books.json
-----
-
-This is nice, because you can individually process a single row. Anytime you can do something like this, it is easy to break a problem into smaller pieces and speed up processing. The following demonstrates how you can read in a single line and process it.
-
-[source,python]
-----
-import json
-
-with open("/anvil/projects/tdm/data/goodreads/goodreads_books.json") as f:
- for line in f:
- print(line)
- parsed = json.loads(line)
- print(f"{parsed['isbn']=}")
- print(f"{parsed['num_pages']=}")
- break
-----
-
-In this project, the overall goal will be to implement functions that perform certain operations, write the best docstrings you can, and use your choice of `pdoc` or `sphinx` to generate a pretty set of documentation.
-
-Begin this project by choosing a tool, `pdoc` or `sphinx`, and setting up a `firstname-lastname-project04.py` module that will host your Python functions. In addition, create a Jupyter Notebook that will be used to test out your functions, and generate your documentation. At the end of this project, your deliverable will be your `.ipynb` notebook and either a series of screenshots that captures your documentation, or a PDF created by exporting the resulting webpage of documentation.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Write a function called `scrape_image_from_url` that accepts a URL (as a string) and returns a `bytes` object of the data. Make sure `scrape_image_from_url` cleans up after itself and doesn't leave any image files on the filesystem.
-
-. Create a variable with a temporary file name using the `uuid` package.
-. Use the `requests` package to get the response.
-+
-[TIP]
-====
-[source,python]
-----
-import requests
-
-response = requests.get(url, stream=True)
-
-# then the first argument to copyfileobj will be response.raw
-----
-====
-+
-. Open the file and use the `shutil` packages `copyfileobj` method to copy the `response.raw` to the file.
-. Open the file and read the contents into a `bytes` object.
-+
-[TIP]
-====
-You can verify a bytes object by:
-
-[source,python]
-----
-type(my_object)
-----
-
-.output
-----
-bytes
-----
-====
-+
-. Use `os.remove` to remove the image file.
-. Return the bytes object.
-
-
-You can verify your function works by running the following:
-
-[source,python]
-----
-import shutil
-import requests
-import os
-import uuid
-import hashlib
-
-url = 'https://images.gr-assets.com/books/1310220028m/5333265.jpg'
-my_bytes = scrape_image_from_url(url)
-m = hashlib.sha256()
-m.update(my_bytes)
-m.hexdigest()
-----
-
-.output
-----
-ca2d4506088796d401f0ba0a72dda441bf63ca6cc1370d0d2d1d2ab949b00d02
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Write a function called `json_to_sql` that accepts a single row of the `goodreads_books.json` file (as a string), a table name (as a string), as well as a `set` of values to "skip". This function should then return a string that is a valid `INSERT INTO` SQL statement. See https://www.sqlitetutorial.net/sqlite-insert/[here] for an example of an `INSERT INTO` statement.
-
-The following is a real example you can test out.
-
-[source,python]
-----
-with open("/anvil/projects/tdm/data/goodreads/goodreads_books.json") as f:
- for line in f:
- first_line = str(line)
- break
-
-first_line
-----
-
-[source,python]
-----
-json_to_sql(first_line, 'books', {'series', 'popular_shelves', 'authors', 'similar_books'})
-----
-
-.output
-----
-"INSERT INTO books (isbn,text_reviews_count,country_code,language_code,asin,is_ebook,average_rating,kindle_asin,description,format,link,publisher,num_pages,publication_day,isbn13,publication_month,edition_information,publication_year,url,image_url,book_id,ratings_count,work_id,title,title_without_series) VALUES ('0312853122','1','US','','','false','4.00','','','Paperback','https://www.goodreads.com/book/show/5333265-w-c-fields','St. Martin's Press','256','1','9780312853129','9','','1984','https://www.goodreads.com/book/show/5333265-w-c-fields','https://images.gr-assets.com/books/1310220028m/5333265.jpg','5333265','3','5400751','W.C. Fields: A Life on Film','W.C. Fields: A Life on Film');"
-----
-
-[TIP]
-====
-Here is some (maybe) helpful logic:
-
-. Use the `loads` to convert json to a dict.
-. Remove all key:value pairs from the dict where the key is in the `skip` set.
-. Form a string of comma separated keys.
-. Form a string of comma separated, single-quoted values.
-. Assemble the `INSERT INTO` statement.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Create a new function, that does something interesting with one or more of these datasets. Just like _all_ the previous functions, make sure to include detailed and clear docstrings.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Generate your final documentation, and assemble and submit your deliverables:
-
-- `.ipynb` file testing out your functions.
-- `firstname-lastname-project04.py` module that includes all of your functions, and associated docstrings.
-- Screenshots and/or a PDF exported from your resulting documentation web page. Basically, something that shows us your resulting documentation.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project05.adoc
deleted file mode 100644
index 3053a6b40..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project05.adoc
+++ /dev/null
@@ -1,149 +0,0 @@
-= TDM 30100: Project 5 -- 2022
-
-**Motivation:** Code, especially newly written code, is refactored, updated, and improved frequently. It is for these reasons that testing code is imperative. Testing code is a good way to ensure that code is working as intended. When a change is made to code, you can run a suite a tests, and feel confident (or at least more confident) that the changes you made are not introducing new bugs. While methods of programming like TDD (test-driven development) are popular in some circles, and unpopular in others, what is agreed upon is that writing good tests is a useful skill and a good habit to have.
-
-**Context:** This is the first of a series of two projects that explore writing unit tests, and doc tests. In The Data Mine, we will focus on using `pytest`, doc tests, and `mypy`, while writing code to manipulate and work with data.
-
-**Scope:** Python, testing, pytest, mypy, doc tests
-
-.Learning Objectives
-****
-- Write and run unit tests using `pytest`.
-- Include and run doc tests in your docstrings, using `pytest`.
-- Gain familiarity with `mypy`, and explain why static type checking can be useful.
-- Comprehend what a function is, and the components of a function in Python.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_book_series.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_books.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json`
-
-== Questions
-
-=== Question 1
-
-There are a variety of different testing packages: `doctest`, `unittest`, `nose`, `pytest`, etc. In addition, you can write actual tests, or even include tests in your documentation!
-
-For the sake of simplicity, we will stick to using two packages: `pytest` and `mypy`.
-
-Create a new working directory in your `$HOME` directory.
-
-[source,bash]
-----
-mkdir $HOME/project05
-----
-
-Copy the following, provided Python module to your working directory.
-
-[source,bash]
-----
-cp /anvil/projects/tdm/data/goodreads/goodreads.py $HOME/project05
-----
-
-Look at the module. Use `pytest` to run the doctests in the module.
-
-[TIP]
-====
-See https://docs.pytest.org/en/7.1.x/how-to/doctest.html[here] for instructions on how to run the doctests using `pytest`.
-====
-
-[NOTE]
-====
-One of the tests will fail. This is okay! We will take care of that later.
-====
-
-[NOTE]
-====
-Run the doctests from within a `bash` cell, so the output shows in the Jupyter Notebook.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-One of the doctests failed. Why? Go ahead and fix it so the test passes.
-
-[WARNING]
-====
-This does _not_ mean modifiy the test itself -- the test is written exactly as intended. Fix the _code_ to handle that scenario.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Add 1 more doctest to `split_json_to_n_parts`, 3 to `get_book_with_isbn`, and 2 more to `get_books_by_author_name`. In a bash cell, re-run your tests, and make sure they all pass.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Doctests are great, but a bit clunky. It is likely better to have 1 or 2 doctests for a function that documents _how_ to use the function with a concrete example, rather than putting all your tests as doctests. Think of doctests more along the lines of documenting usage, and as a bonus you get a couple extra tests to run.
-
-For example, the first `split_json_to_n_parts` doctest, would be much better suited as a unit test, so it doesn't crowd the readability of the docstring. Create a `test_goodreads.py` module in the same directory as your `goodreads.py` module. Move the first doctest from `split_json_to_n_parts` into a `pytest` unit test.
-
-In a bash cell, run the following in order to make sure the test passes.
-
-[source,ipython]
-----
-%%bash
-
-cd ~/project05
-python3 -m pytest
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Include your `scrape_image_from_url` function from the previous project in your `goodreads.py`. Write at least 1 doctest and at least 1 unit test for this function. Make sure the tests pass. Run the tests from a bash cell so the graders can see the output.
-
-[NOTE]
-====
-For this question, it is okay if the doctest and unit test test the same thing. This is all just for practice.
-====
-
-[WARNING]
-====
-Make sure you submit the following files:
-
-- the `.ipynb` notebook with all cells executed and output displayed (including the output of the tests).
-- the `goodreads.py` file containing all of your code.
-- the `test_goodreads.py` file containing all of your unit tests (should be 2 unit tests total).
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project06.adoc
deleted file mode 100644
index 7f3e13ea8..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project06.adoc
+++ /dev/null
@@ -1,78 +0,0 @@
-= TDM 30100: Project 6 -- 2022
-
-**Motivation:** Code, especially newly written code, is refactored, updated, and improved frequently. It is for these reasons that testing code is imperative. Testing code is a good way to ensure that code is working as intended. When a change is made to code, you can run a suite a tests, and feel confident (or at least more confident) that the changes you made are not introducing new bugs. While methods of programming like TDD (test-driven development) are popular in some circles, and unpopular in others, what is agreed upon is that writing good tests is a useful skill and a good habit to have.
-
-**Context:** This is the first of a series of two projects that explore writing unit tests, and doc tests. In The Data Mine, we will focus on using `pytest`, doc tests, and `mypy`, while writing code to manipulate and work with data.
-
-**Scope:** Python, testing, pytest, mypy, doc tests
-
-.Learning Objectives
-****
-- Write and run unit tests using `pytest`.
-- Include and run doc tests in your docstrings, using `pytest`.
-- Gain familiarity with `mypy`, and explain why static type checking can be useful.
-- Comprehend what a function is, and the components of a function in Python.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data`
-
-== Questions
-
-[NOTE]
-====
-We will dig in a little deeper in the next project, however, this project is designed to give you a little bit of a rest before October break.
-====
-
-We've provided you with two files for this project:
-
-. `/anvil/projects/tdm/etc/project06.py`
-. `/anvil/projects/tdm/etc/test_project06.py`
-
-Start by copying these files to your own working directory.
-
-[source,ipython]
-----
-%%bash
-
-rm -rf $HOME/project06 || true
-mkdir $HOME/project06
-cp /anvil/projects/tdm/etc/project06.py $HOME/project06
-cp /anvil/projects/tdm/etc/test_project06.py $HOME/project06
-----
-
-The first file, `project06.py` is a module with a bunch of functions. The second file, `test_project06.py` is the set of tests for the `project06.py` module. You can run the tests as follows.
-
-[source,ipython]
-----
-%%bash
-
-cd $HOME/project06
-python3 -m pytest
-----
-
-The goal of this project is to fix all of the code in `project06.py` so that all of the unit tests pass. Do _not_ modify the tests in `test_project06.py`, _only_ modify the code in `project06.py`.
-
-. Fix `find_longest_timegap`.
-. Fix `space_in_dir`.
-. Fix `event_plotter`.
-. Fix `player_info`.
-
-.Items to submit
-====
-- Your modified `project06.py` file.
-- Your `.ipynb` notebook file with a `bash` cell showing 100 percent of your tests passing.
-- Your `.ipynb` notebook file with a markdown cell for each question, and an explanation of what was wrong, and how you fixed it.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project07.adoc
deleted file mode 100644
index 7dc078a50..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project07.adoc
+++ /dev/null
@@ -1,145 +0,0 @@
-= TDM 30100: Project 7 -- 2022
-
-**Motivation:** Code, especially newly written code, is refactored, updated, and improved frequently. It is for these reasons that testing code is imperative. Testing code is a good way to ensure that code is working as intended. When a change is made to code, you can run a suite a tests, and feel confident (or at least more confident) that the changes you made are not introducing new bugs. While methods of programming like TDD (test-driven development) are popular in some circles, and unpopular in others, what is agreed upon is that writing good tests is a useful skill and a good habit to have.
-
-**Context:** This is the first of a series of two projects that explore writing unit tests, and doc tests. In The Data Mine, we will focus on using `pytest`, doc tests, and `mypy`, while writing code to manipulate and work with data.
-
-**Scope:** Python, testing, pytest, mypy, doc tests
-
-.Learning Objectives
-****
-- Write and run unit tests using `pytest`.
-- Include and run doc tests in your docstrings, using `pytest`.
-- Gain familiarity with `mypy`, and explain why static type checking can be useful.
-- Comprehend what a function is, and the components of a function in Python.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/*`
-
-== Questions
-
-=== Question 1
-
-In the previous project, you were given a Python module and a test module. You found bugs in the code and fix the functions to pass the provided tests. This is good practice fixing code based on tests. In this project, you will get an opportunity to try _test driven development_ (TDD). This is (roughly) where you write tests first, _then_ write your code to pass your tests.
-
-There are some good discussions on TDD https://buttondown.email/hillelwayne/archive/i-have-complicated-feelings-about-tdd-8403/[here] and https://news.ycombinator.com/item?id=32509268[here].
-
-Start by choosing 1 dataset from our data directory: `/anvil/projects/tdm/data`. This will be the dataset which you operate on for the remainder of the project. Alternatively, you may scrape data from online as your "data source".
-
-In a markdown cell, describe 3 functions that you will write, and what those functions should do.
-
-Create two files: `$HOME/project07/project07.py` and `$HOME/project07/test_project07.py`.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Expand on question (1). In `$HOME/project07/project07.py` create the 3 functions, and write detailed, https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html[google style] docstrings for each function. Leave each function blank, with just a `pass` keyword. For example.
-
-[source,python]
-----
-def my_super_awesome_function(some_parameter: str) -> str:
- """
- My detailed google style docstring here...
- """
- pass
-----
-
-[WARNING]
-====
-- Make sure the reader can understand what your functions do based on the docstrings.
-- Your functions should not be anything trivial like splitting a string, summing data, or anything that could be easily accomplished in a single line of code using other built-in methods. This is your chance to get creative!
-====
-
-[TIP]
-====
-`pass` is a keyword that you can use to "pass" or just not perform any operation. Without `pass` in this function, you would get an error.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Write _at least_ 2 unit tests (using `pytest`) for _each_ function. Each `assert` counts as a unit test.
-
-Write an additional test for each function (again, using `pytest`). For this set of tests, experiment with features from `pytest` that you haven't tried before! https://docs.pytest.org/en/7.1.x/index.html[This] is the official `pytest` documentation. Some options could be: https://docs.pytest.org/en/7.1.x/how-to/fixtures.html[fixtures], https://docs.pytest.org/en/7.1.x/how-to/parametrize.html[parametrizing], or https://docs.pytest.org/en/7.1.x/how-to/tmp_path.html[using temporary directories and files].
-
-In a bash cell, run your `pytest` tests, which should all fail.
-
-[source,ipython]
-----
-%%bash
-
-cd $HOME/project07
-python3 -m pytest
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Begin writing your functions by filling in `$HOME/project07/project07.py`. Record each time your re-run your tests in a new bash cell, by running the following.
-
-[source,ipython]
-----
-%%bash
-
-cd $HOME/project07
-python3 -m pytest
-----
-
-[WARNING]
-====
-Please record each re-run of your test in a **new** bash cell -- you could end up with 10 or more cells where you've run your tests. We want to see the progression as you write the functions and how the failures change as you fix your code. You _don't_ need to record the changes you make to `project07.py`, but we _do_ want to see the results of running the tests each time you run them.
-====
-
-Continue to re-run tests until they all pass and your functions work as intended.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-This is _perhaps_ a very different style of writing code for you, depending on your past experiences! After giving it a try, how do you like TDD? If you haven't yet, read some article online about TDD. Do you think you should always use TDD? What are your opinions. Write a few sentences about your thoughts, and where you stand on TDD.
-
-[WARNING]
-====
-At this end of this project you should submit the following.
-
-- A `.ipynb` file with your results from running your tests initially in question (3), and repeatedly, until they pass, in question (4).
-- Your `project07.py` file with your passing functions, and beautiful docstrings.
-- Your `test_project07.py` file with at least 9 total (passing) tests, 3 of which should explore previously mentioned "new" features of `pytest`.
-====
-
-.Items to submit
-====
-- A few sentences in a markdown cell describing what you think about TDD.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project08.adoc
deleted file mode 100644
index 646529411..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project08.adoc
+++ /dev/null
@@ -1,414 +0,0 @@
-= TDM 30100: Project 8 -- 2022
-
-**Motivation:** Python is an incredible language that enables users to write code quickly and efficiently. When using Python to write code, you are _typically_ making a tradeoff between developer speed (the time it takes to write a functioning program) and program speed (how fast your code runs). Python code does _not_ have the advantage of easily being compiled to machine code and shared. In Python, you need to learn how to use virtual environments, and it is good to have an understanding of how to build and push a package to pypi.
-
-**Context:** This is the first in a series of 3 projects that focuses on setting up and using virtual environments, and creating a package. This is not intended to teach you everything, but rather, give you some exposure to the topics.
-
-**Scope:** Python, virtual environments, pypi
-
-.Learning Objectives
-****
-- Explain what a virtual environment is and why it is important.
-- Create, update, and use a virtual environment to run somebody else's Python code.
-- Create a Python package and publish it on pypi.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/movies_and_tv/imdb.db`
-
-== Questions
-
-=== Question 1
-
-This project will be focused on creating, updating, and understanding Python virtual environments. Since this is The Data Mine, we will pepper in some small data-related tasks, like writing functions to operate on data, but the focus is on virtual environments.
-
-Let's get started.
-
-Use https://realpython.com/python-virtual-environments-a-primer/#how-can-you-work-with-a-python-virtual-environment[this article] as your reference. First thing is first. We have a Jupyter notebook that we tend to work in, running our `bash` code in `bash` cells. This is very different than your typical environment. For this reason, let's start by popping open a terminal, and working in the terminal.
-
-You can open a terminal in JupyterLab by clicking on the blue "+" button in the upper left-hand corner of the Jupyter interface. Scroll down to the last row and click on the button that says "Terminal".
-
-Start by taking a look at which `python3` you are running. Run the following in the terminal.
-
-[source,bash]
-----
-which python3
-----
-
-Take a look at the available packages as follows.
-
-[source,bash]
-----
-python3 -m pip list
-----
-
-This doesn't look right, it doesn't look like our f2022-s2023 environment, does it? It doesn't even have `pandas` installed. This is because we don't have JupyterLab configured to have our f2022-s2023 version of Python pre-loaded in a fresh terminal session. In fact, with this project, we aren't going to use that environment!
-
-[NOTE]
-====
-The `f2022-s2023` environment runs inside a container. You will learn more about this later on, but suffice it to say it makes it much more difficult to do what we want to do for this project.
-====
-
-Instead, we are going to use the non-containerized version of Python that is running the JupyterLab instance itself! To load up this environment, run the following.
-
-[source,bash]
-----
-module load python/jupyterlab
-----
-
-Then, check out how things have changed.
-
-[source,bash]
-----
-which python3
-----
-
-[source,bash]
-----
-python3 -m pip list
-----
-
-Looks like we are getting there! Let's back up a bit and explain some things.
-
-What does `which python3` do? `which` will print out the absolute path to the command which would be executed. In this case, running `python3` would be the same as executing `/anvil/projects/tdm/apps/python/3.10.5/bin/python3`.
-
-What does the `python3 -m pip` mean? The `-m` stands for https://docs.python.org/3.8/using/cmdline.html#cmdoption-m[module-name]. In a nutshell, this ensures that the correct `pip` -- the `pip` associated with the current `python3` is used! This is important, because, if you have many versions of Python installed on your system, if environment variables aren't correctly set, it could be possible to use a completely different `pip` associated with a completely different version of Python, which could cause all sorts of errors! To prevent this, it is safer to do `python3 -m pip` instead of just `pip`.
-
-What does `python3 -m pip list` do? The `python3 -m pip` is the same as before. The `list` command is an argument you can pass to `pip` that lists the packages installed in the current environment.
-
-Perform the following operations.
-
-. Use `venv` to create a new virtual environment called `question01`.
-. Confirm that the virtual environment has been created by running the following.
-+
-[source,bash]
-----
-source question01/bin/activate
-----
-+
-. This should _activate_ your virtual environment. You will notice that `python3` now points to an interpreter in your virtual environment directory.
-+
-[source,bash]
-----
-which python3
-----
-+
-.output
-----
-/path/to/question01/bin/python3
-----
-+
-. In addition, you can see the blank slate when it comes to installed Python packages.
-+
-[source,bash]
-----
-python3 -m pip list
-----
-+
-.output
-----
-Package Version
----------- -------
-pip 22.0.4
-setuptools 58.1.0
-WARNING: You are using pip version 22.0.4; however, version 22.2.2 is available.
-You should consider upgrading via the '/home/x-kamstut/question01/bin/python3 -m pip install --upgrade pip' command.
-----
-
-.Items to submit
-====
-See https://the-examples-book.com/projects/current-projects/templates#including-an-image-in-your-notebook[here] on how to properly include a screenshot in your Jupyter notebook. If you do _not_ properly submit the screenshot, you will likely lose points, so _please_ take a minute to read it.
-
-- Screenshot showing `source question01/bin/activate` output.
-- Screenshot showing `which python3` output _after_ activating the virtual environment.
-- Screenshow showing `python3 -m pip list` output _after_ activating the virtual environment.
-====
-
-=== Question 2
-
-Okay, in question (1) you ran some commands and supposedly created your own virtual environment. You are possibly still confused on what you did or why -- that is okay! Things will _hopefully_ become more clear as you progress.
-
-Read https://realpython.com/python-virtual-environments-a-primer/#why-do-you-need-virtual-environments[this] section of the article provided in question (1). In your own words, explain 2 good reasons why virtual environments are important when using Python. Place these explanations in a markdown cell in your notebook.
-
-[NOTE]
-====
-We are going to create and modify and destroy environments quite a bit! Don't be intimidated by messing around with your environment.
-====
-
-Okay, now that you've grokked why virtual environments are important, let's try to see a virtual environment in action.
-
-Activate your empty virtual environment from question (1) (if it is not already active). If you were to try and import the `requests` package, what do you expect would happen? If you were to deactivate your virtual environment and then try and import the `requests` package, what would you expect would happen?
-
-Test out both! First activate your virtual environment from question (1), and then run `python3` and try to `import requests`. Next, run `deactivate` to deactivate your virtual environment. Run `python3` and try to `import requests`. Were the results what you expected? Please include 2 screenshots -- 1 for each attempt at importing `requests`.
-
-[NOTE]
-====
-As you should _hopefully_ see -- the virtual environments _do_ work! When a certain environment is active, only a certain set of packages is made available! Pretty cool!
-====
-
-.Items to submit
-====
-- 1-2 sentences, _per reason_, on why virtual environments are important when using Python.
-- 1 screenshot showing the attempt to import the `requests` library from within your question01 virtual environment.
-- 1 screenshot showing the attempt to import the `requests` library from outside the question01 virtual environment.
-====
-
-=== Question 3
-
-Create a Python script called `imdb.py` that accepts a single argument, `id`, and prints out the following.
-
-[source,bash]
-----
-python3 imdb.py imdb tt4236770
-----
-
-.output
-----
-Title: Yellowstone
-Rating: 8.6
-----
-
-You can use the following as your skeleton.
-
-[source,python]
-----
-#!/usr/bin/env python3
-
-import argparse
-import sqlite3
-import sys
-from rich import print
-
-def get_info(iid: str) -> None:
- """
- Given an imdb id, print out some basic info about the title.
- """
-
- conn = sqlite3.connect("/anvil/projects/tdm/data/movies_and_tv/imdb.db")
- cur = conn.cursor()
-
- # make a query (fill in code here)
-
- # print results
- print(f"Title: [bold blue]{title}[/bold blue]\nRating: [bold green]{rating}[/bold green]")
-
-
-def main():
- parser = argparse.ArgumentParser()
- subparsers = parser.add_subparsers(help="possible commands", dest="command")
- some_parser = subparsers.add_parser("imdb", help="")
- some_parser.add_argument("id", help="id to get info about")
-
- if len(sys.argv) == 1:
- parser.print_help()
- sys.exit(1)
-
- args = parser.parse_args()
-
- if args.command == "imdb":
- get_info(args.id)
-
-if __name__ == "__main__":
- main()
-----
-
-Deactivate any environment you may have active.
-
-[source,bash]
-----
-deactivate
-----
-
-Confirm that the proper `python3` is active.
-
-[source,bash]
-----
-which python3
-----
-
-.output
-----
-/anvil/projects/tdm/apps/python/3.10.5/bin/python3
-----
-
-Now test out your script by running the following.
-
-[source,bash]
-----
-python3 imdb.py imdb tt4236770
-----
-
-What happens? Well, the package `rich` should not be installed to our current environment. Easy enough to fix, right? After all, we know how to make our own virtual environments now!
-
-Create a virtual environment called `question03`. This time, when creating your virtual environment, add an additional flag `--copies` to the very end of the command. Activate your virtual environment and confirm that we are using the correct environment.
-
-[source,bash]
-----
-source question03/bin/activate
-which python3
-----
-
-Immediately trying the script again should fail, since we _still_ don't have the `rich` package installed.
-
-[source,bash]
-----
-python3 imdb.py imdb tt4236770
-----
-
-.output
-----
-ModuleNotFoundError: No module named 'rich'
-----
-
-Okay! Use `pip` (using our `python3 -m pip` trick) to install `rich` and try to run the script again!
-
-Not only should the script now work, but, if you take a look at the packages installed in your environment, there should be some new additions.
-
-[source,bash]
-----
-python3 -m pip list
-----
-
-.output
-----
-Package Version
----------- -------
-commonmark 0.9.1
-pip 22.0.4
-Pygments 2.13.0
-rich 12.6.0
-setuptools 58.1.0
-----
-
-That is awesome! You just solved the issue of not being able to run some Python code because a package was not installed for you. You did this by first creating your own custom Python virtual environment, installing the required package to your virtual environment, and then executing the code that wasn't previously working!
-
-.Items to submit
-====
-- Screenshot showing the activation of the `question03` virtual environment, the `pip` install, and successful output of the script.
-- Screenshot showing the resulting set of packages, `python3 -m pip list`, for the `question03` virtual environment.
-====
-
-=== Question 4
-
-Okay, let's take a tiny step back to peek at a few underlying details of our `question01` and `question03` virtual environments.
-
-Specifically, start with the `question01` environment. The entire environment lives within that `question01` directory doesn't it? Or _does it!?
-
-[source,bash]
-----
-ls -la question01/bin
-----
-
-Notice anything about the contents of the `question01` bin directory? They are symbolic links! `python3` actually points to the same interpreter that was active when we created the virtual environment, the `/anvil/projects/tdm/apps/python/3.10.5/bin/python3` interpreter! But wait, how do we have a different set of packages then, if we are using the same Python interpreter? The answer is, your Python interpreter will look in a variety of locations for your packages. By activating your virtual environment, we've altered our `PYTHONPATH`.
-
-If you run the following, you will see the list of directories that Python searches for packages, when importing.
-
-[source,python]
-----
-import sys
-
-sys.path
-----
-
-.example output
-----
-['', '/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10/site-packages', '/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10', '/anvil/projects/tdm/apps/python/3.10.5/lib/python310.zip', '/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10', '/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10/lib-dynload', '/home/x-kamstut/question01/lib/python3.10/site-packages']
-----
-
-`sys.path` is initialized from the `PYTHONPATH` environment variable, plus some additional installation-dependent defaults. If you take a peek in `question01/lib/python3.10/site-packages`, you will see where `rich` is located. So, even if you look `/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10/site-packages`, and see that `rich` is _not_ installed in that location, because Python searches _all_ of those locations for `rich` and `rich` _is_ installed in `question01/lib/python3.10/site-packages`, it will be successfully imported!
-
-This begs the question, what if `/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10/site-packages` has an _older_ version of `rich` installed -- which version will be imported? Well, let's test this out!
-
-If you look at `plotly` in the jupyterlab environment, you will see it is version 5.8.2.
-
-[source,python]
-----
-import plotly
-plotly.__version__
-----
-
-.output
-----
-5.8.2
-----
-
-Activate your `question03` environment and install `plotly==5.10.0`. Re-run the following code.
-
-[source,python]
-----
-import plotly
-plotly.__version__
-----
-
-What is your output? Is that expected?
-
-[WARNING]
-====
-We modified this question Thursday, October 27 due to a mistake by your instructor (Kevin). If you previously did this problem, no worries, you will get credit either way.
-====
-
-[NOTE]
-====
-If you take a look at `question03/bin/python` you will notice that they are _not_ symbolic links, but actual copies of the original interpreter! This is what the `--copies` argument did earlier on! In general, you'll likely be fine using `venv` without the `--copies` flag.
-====
-
-.Items to submit
-====
-- Screenshots of your operations performed from start to finish for this question.
-- 1-2 sentences explaining where Python looks for packages.
-====
-
-=== Question 5
-
-Last, but certainly not least, is the important topic of _pinning_ dependencies. This practice will allow someone else to replicate the exact set of packages needed to run your Python application.
-
-By default, `python3 -m pip install numpy` will install the newest compatible version of numpy to your current environment. Sometimes, that version could be too new and create issues with old code. This is why pinning is important.
-
-You can choose to install an exact version of a package by specifying the version. For example, you could install `numpy` version 1.16, even though the newest version is (as of writing) 1.23. Just run `python3 -m pip install numpy==1.16`.
-
-This is great, but is there an easy way to pass an entire list of all of the packages in your current virtual environment? Yes! Yes there is! Try it out.
-
-[source,bash]
-----
-python3 -m pip freeze > requirements.txt
-cat requirements.txt
-----
-
-That's pretty cool! That is a specially formatted list containing a pinned set of packages. You could do the reverse as well. Create a new file called `requirements.txt` with the following contents copied and pasted.
-
-.requirements.txt contents
-----
-commonmark==0.9.1
-plotly==5.10.0
-Pygments==2.13.0
-requests==2.2.1
-rich==12.6.0
-tenacity==8.1.0
-thedatamine==0.1.3
-----
-
-You can use the `-r` option of `pip` to install all of those pinned packages to an environment. Test it out! Create another new virtual environment called `question05`, activate the environment, and use the `-r` option and the `requirements.txt` file to install all of the packages, with the exact same versions. Double check that the results are the same, and that the installed packages are identical to the `requirements.txt` file.
-
-Great job! Now, with some Python code, and a `requirements.txt` file, you should be able to setup a virtual environment and run your friend or co-workers code! Very cool!
-
-[NOTE]
-====
-Unfortunately, there is more to this mess than meets the eye, and a _lot_ more that can go wrong. But these basics will serve you well and help you solve lots and lots of problems!
-====
-
-.Items to submit
-====
-- Screenshots showing the results of running the bash commands from the start of this question to the end.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project09.adoc
deleted file mode 100644
index d436c0566..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project09.adoc
+++ /dev/null
@@ -1,481 +0,0 @@
-= TDM 30100: Project 9 -- 2022
-
-**Motivation:** Python is an incredible language that enables users to write code quickly and efficiently. When using Python to write code, you are _typically_ making a tradeoff between developer speed (the time it takes to write a functioning program) and program speed (how fast your code runs). Python code does _not_ have the advantage of easily being compiled to machine code and shared. In Python, you need to learn how to use virtual environments, and it is good to have an understanding of how to build and push a package to pypi.
-
-**Context:** This is the second in a series of 3 projects that focuses on setting up and using virtual environments, and creating a package. This is not intended to teach you everything, but rather, give you some exposure to the topics.
-
-**Scope:** Python, virtual environments, pypi
-
-.Learning Objectives
-****
-- Explain what a virtual environment is and why it is important.
-- Create, update, and use a virtual environment to run somebody else's Python code.
-- Create a Python package and publish it on pypi.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data`
-
-== Questions
-
-=== Question 1
-
-In the previous project, the author made a mistake that _may_ have caused some confusion. Many apologies! Therefore, this first question is going to be a review of what was accomplished in the previous project.
-
-Like the previous project, this project will consist primarily of your Jupyter notebook, with screenshots of your terminal after running commands. `bash` cells will _not_ work properly due to the way our environment is setup. For this reason, it is important to first open a terminal from within Jupyter Lab, to run commands, take screenshots, and display the screenshots in your Jupyter Notebook (your `.ipynb` file).
-
-**Activate our "base" image, so we start with Python 3.10 instead of Python 3.6:**
-
-[source,bash]
-----
-module load python/jupyterlab
-
-# which interpreter is active?
-which python3
-
-# list our current set of python packages using pip
-python3 -m pip list
-----
-
-**Create a virtual environment:**
-
-[source,bash]
-----
-# create the virtual environment named p9q1 in your $HOME directory
-python3 -m venv $HOME/p9q1
-
-# check out what files and folders the environment consist of
-ls -la $HOME/p9q1
-
-# which python are we currently using?
-which python3
-
-# which packages?
-python3 -m pip list
-
-# activate our newly created virtual environment
-source $HOME/p9q1/bin/activate
-
-# which python are we currently using?
-which python3
-
-# what packages do we have available?
-python3 -m pip list
-----
-
-**Activate a virtual environment:**
-
-[source,bash]
-----
-source /path/to/my/virtual/environment/bin/activate
-
-# for example, if our virtual environment was called myvenv in the $HOME directory
-source $HOME/myvenv/bin/activate
-----
-
-**Deactivate a virtual environment:**
-
-[source,bash]
-----
-deactivate
-which python3 # will no longer point to interpreter inside the virtual environment folder
-----
-
-**Install a single package to your virtual environment:**
-
-[source,bash]
-----
-# first activate the virtual environment
-source $HOME/p9q1/bin/activate
-
-# install the requests package
-python3 -m pip install requests
-----
-
-**Pin dependencies, or make the current virtual environment rebuildable by others:**
-
-[source,bash]
-----
-# first activate the virtual environment you'd like to pin the dependencies for
-source $HOME/p9q1/bin/activate
-
-# next, create a requirements.txt file with all of the packages, pinned
-python3 -m pip freeze > $HOME/requirements.txt
-----
-
-**Build a fresh virtual environment using someone else's pinned dependencies:**
-
-[source,bash]
-----
-# create the blank environment
-python3 -m venv $HOME/friendsenv
-
-# activate the environment
-source $HOME/friendsenv/bin/activate
-
-# install the _exact_ packages your friend had, using their requirements.txt
-python3 -m pip install -r $HOME/requirements.txt
-
-# verify the packages are installed
-python3 -m pip list
-python3 -m pip freeze
-----
-
-**Delete a virtual environment you don't use anymore:**
-
-[source,bash]
-----
-# IMPORTANT: Ensure you do NOT have a typo
-rm -rf $HOME/p9q1
-----
-
-Run through some of those commands, until it "pretty much" clicks. Take and include at least a couple screenshots -- no need to include everything if you feel comfortable with everything shown above.
-
-[WARNING]
-====
-Make sure to take screenshots showing your input and output from the terminal throughout this project. You final submission should show all of the steps as you walk through the project.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Wow! When you look at all of that information from question (1), virtual environments aren't really all that much work to use!
-
-Okay, if you _haven't_ already done something similar (some of you may have), I imagine this next statement is going to be pretty exciting. By the end of this project you will create a new virtual environment and `pip install` your very own package, from https://pypi.org/!
-
-Let's start by writing the heart and soul of your package -- a function that, given an imdb id, scrapes and returns the rating.
-
-. Create a new virtual environment to work in called `question02`.
-. Install one or more of the following packages to your environment (you will probably want at least 2 of these to write this function): `requests`, `beautifulsoup4`, `lxml`.
-. Write and test out your function, `get_rating`.
-
-Please include screenshots of the above steps, all the way until the end where a rating should print for an imdb title.
-
-[TIP]
-====
-For example, https://www.imdb.com/title/tt4236770/?ref_=nv_sr_srsg_0 would have an imdb title id of tt4236770. We want the functionality to look like the following.
-
-[source,python]
-----
-get_rating("tt4236770")
-----
-
-.output
-----
-8.7
-----
-====
-
-[TIP]
-====
-You can use the following as a skeleton -- just fill in part of the xpath expression.
-
-[source,python]
-----
-import requests
-import lxml.html
-
-def get_rating(tid: str) -> float:
- """
- Given an imdb title id, return the title's rating.
- """
- resp = requests.get(f"https://www.imdb.com/title/{tid}", stream=True)
- resp.raw.decode_content = True
- tree = lxml.html.parse(resp.raw)
- element = tree.xpath("//div[@data-testid='FILL THIS IN']/span")[0]
- return float(element.text)
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-The next step in this process is to organize your files. Let's make this a simple, barebones setup.
-
-First, let's decide on the package name. Choose a package name starting with `tdm-`. For example, `tdm-drward`. Create a project directory with the same name as you package. For example, mine would be the following.
-
-[source,bash]
-----
-mkdir $HOME/tdm-drward
-----
-
-Great! This will be the name you use to install via `pip`. So in my case, it would be `python3 -m pip install tdm-drward`.
-
-Next, create 3 new files inside the `tdm-drward` (or equivalent) folder.
-
-- `LICENSE`
-- `pyproject.toml`
-- `README.md`
-
-The first file is a simple text file containing the text of your license. You can use https://choosealicense.com/ to choose a license and paste the text of the license in your `LICENSE` file.
-
-The third file is a `README.md` -- a simple markdown file where you will eventually keep the important instructions for your package. For now, go ahead and just leave it blank.
-
-The second file is a critical file that will be used to specify various bits of information about you package. For now, you can leave it blank.
-
-Next, create a new directory _inside_ the `$HOME/tdm-drward` package directory. Name the directory whatever you want. This will be the name that is used when importing your package. For example, I made `$HOME/tdm-drward/imdb`. For my package, I will do something like:
-
-[source,python]
-----
-import imdb
-
-# or
-
-from imdb import get_rating
-----
-
-Finally, copy and paste your `get_rating` function into a new file called `imdb.py`, and drop `imdb.py` into `$HOME/tdm-drward/imdb` (or your equivalent package path). In addition, create another new file called `\\__init__.py` in the same directory. Leave it blank for now.
-
-[TIP]
-====
-Your directory structure should look something like the following.
-
-[source,bash]
-----
-tree $HOME/tdm-drward
-----
-
-.directory structure
-----
-tdm-drward
-├── imdb
-│ ├── imdb.py
-│ └── __init__.py
-├── LICENSE
-├── pyproject.toml
-└── README.md
-
-1 directory, 5 files
-----
-====
-
-Fantastic! Now, let's create a new virtual environment called `p9q3`, activate the environment, and run the following.
-
-[source,bash]
-----
-python3 -m pip install -e $HOME/tdm-drward
-----
-
-This will install the package to your `p9q3` virtual environment so you can test it out and see if it is working as intended! Let's go ahead and test it to see if it is doing what we want. Run `python3` to launch a Python interpreter for our virtual environment. Run the following Python code from within the interpreter.
-
-[source,python]
-----
-import imdb # works
-print(imdb.__version__) # error
-imdb.get_rating("tt4236770") # error
-imdb.imdb.get_rating("tt4236770") # works
-from imdb import get_rating # error
-get_rating("tt4236770") # error
-----
-
-What happens? Well, it isn't behaving exactly like we want, but we _can_ import things.
-
-[source,python]
-----
-import imdb.imdb
-imdb.imdb.get_rating("tt4236770") # will work
-
-from imdb.imdb import get_rating
-get_rating("tt4236770") # will also work
-----
-
-Here is the critial part, the `\\__init__.py` file. Any directory containing a `\\__init__.py` file is the indicator that forces Python to treat the directory as a package. If you have a complex or different directory structure, you can add code to `\\__init__.py` that will clean up your imports. When a package is imported, the code in `\\__init__.py` is executed. You can read more about this https://docs.python.org/3/tutorial/modules.html[here].
-
-Go ahead and add code to `\\__init__.py`.
-
-[source,python]
-----
-from .imdb import *
-
-__version__ = "0.0.1"
-
-print("Hi! You must have imported me!")
-----
-
-Re-install the package.
-
-[source,bash]
-----
-python3 -m pip install -e $HOME/tdm-drward
-----
-
-Now, launch a Python interpreter again and try out our original code.
-
-[source,python]
-----
-import imdb # works, prints your message
-print(imdb.__version__) # prints 0.0.1
-imdb.get_rating("tt4236770") # works
-imdb.imdb.get_rating("tt4236770") # still works
-from imdb import get_rating # works
-get_rating("tt4236770") # works
-----
-
-Wow! Okay, this should start to make a bit more sense now. Go ahead and remove the silly print statement in your `\\__init__.py` -- we don't want that anymore!
-
-Finally, let's take a look at the `pyproject.toml` file and fill is some info about our package.
-
-.pyproject.toml
-----
-[build-system]
-requires = ["setuptools>=61.0.0", "wheel"]
-build-backend = "setuptools.build_meta"
-
-[project]
-name = "FILL IN"
-version = "0.0.1"
-description = "FILL IN"
-readme = "README.md"
-authors = [{ name = "FILL IN", email = "FILLIN@purdue.edu" }]
-license = { file = "LICENSE" }
-classifiers = [
- "License :: OSI Approved :: MIT License",
- "Programming Language :: Python",
- "Programming Language :: Python :: 3",
-]
-keywords = ["example", "imdb", "tutorial", "FILL IN"]
-dependencies = [
- "lxml >= 4.9.1",
- "requests >= 2.28.1",
-]
-requires-python = ">=3.10"
-----
-
-Be sure to fill in the "FILL IN" parts with your information! Lastly, make sure to specify any other Python packages that _your_ package depends on in the "dependencies" section. In the provided example, I require the package "lxml" of at least version 4.9.1, as well as the"requests" package with at least version 2.28.1. This makes it so when we `pip install` our package, that these other packages and _their_ dependencies are _also_ installed -- pretty cool!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Okay, to the best of our knowledge our package is ready to go and we want to make it publicly available to `pip install`. The next step in the process is to register an account with https://test.pypi.org https://test.pypi.org/account/register/[here]. Take note of your username and password.
-
-Next, confirm your email address. Open up the email you used to register and click on the link that was sent to you.
-
-Finally, its time to publish your package to the test package repository!
-
-In order to build and publish your package, we need two packages: `build` and `twine`. Let's setup a virtual environment and install those packages so we can use them!
-
-. Deactivate any environment that may already be active by running: `deactivate`.
-. Create a new virtual environment called `p9q4`.
-. Activate your `p9q4` virtual environment.
-. Use `pip` to install `build` and `twine`: `python3 -m pip install build twine`.
-. Build your package.
-+
-[TIP]
-====
-[source,bash]
-----
-python3 -m build $HOME/tdm-drward
-----
-====
-+
-. Check your package.
-+
-[TIP]
-====
-[source,bash]
-----
-python3 -m twine check $HOME/tdm-drward/dist/*
-----
-
-You may get a warning, that is ok.
-====
-+
-. Upload your package.
-+
-[TIP]
-====
-[source,bash]
-----
-python3 -m twine upload -r testpypi $HOME/tdm-drward/dist/*
-----
-
-You will be prompted to enter your username and password. Enter the credentials associated with your newly created account.
-====
-
-Congrats! You can search for your package at https://test.pypi.org. You are ready to publish the real thing!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Okay, register for a Pypi account https://pypi.org/account/register/[here].
-
-Next, verify your account by checking your associated email account and clicking on the provided link.
-
-At this stage, you already built your package using `python3 -m build`, so you are ready to simply upload your package!
-
-. Deactivate any currently active virtual environment by running: `deactivate`.
-. Create a new virtual environment called `p9q5`.
-. Activate your `p9q5` virtual environment.
-. Use `pip` to install `twine`: `python3 -m pip install twine`.
-. Upload your package: `python3 -m twine upload $HOME/tdm-drward/dist/*`
-+
-[TIP]
-====
-You will be prompted to enter your username and password. Enter the credentials associated with your newly created account.
-====
-+
-. Fantastic! Take a look at https://pypi.org and search for your package! Even better, let's test it out!
-. Your `p9q5` virtual environment should still be active, let's pip install your package!
-+
-[source,python]
-----
-python3 -m pip install tdm-drward
-----
-+
-[TIP]
-====
-Of course, replace `tdm-drward` with your package name!
-====
-+
-. Finally, test it out! Launch a Python interpreter and run the following.
-+
-[source,python]
-----
-import imdb
-imdb.get_rating("tt4236770") # success!
-----
-
-Congratulations! I hope you all feel empowered to create your own packages!
-
-[WARNING]
-====
-Make sure to take screenshots showing your input and output from the terminal throughout this project. You final submission should show all of the steps as you walk through the project.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project10.adoc
deleted file mode 100644
index 5b2ad07ec..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project10.adoc
+++ /dev/null
@@ -1,394 +0,0 @@
-= TDM 30100: Project 10 -- 2022
-
-**Motivation:** Python is an incredible language that enables users to write code quickly and efficiently. When using Python to write code, you are _typically_ making a tradeoff between developer speed (the time it takes to write a functioning program) and program speed (how fast your code runs). Python code does _not_ have the advantage of easily being compiled to machine code and shared. In Python, you need to learn how to use virtual environments, and it is good to have an understanding of how to build and push a package to pypi.
-
-**Context:** This is the third in a series of 3 projects that focuses on setting up and using virtual environments, and creating a package. This is not intended to teach you everything, but rather, give you some exposure to the topics.
-
-**Scope:** Python, virtual environments, pypi
-
-.Learning Objectives
-****
-- Explain what a virtual environment is and why it is important.
-- Create, update, and use a virtual environment to run somebody else's Python code.
-- Create a Python package and publish it on pypi.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Questions
-
-=== Question 1
-
-In the previous project, you had the opportunity to create a single-function package and publish it to pypi.org! While pretty exciting, we did gloss over some good-to-know tidbits of information. In this project, we will update the package from the previous project and cover some of these missing bits of information. Lastly, we will make modifications to the project to prime it for the API we will begin to build in the remaining projects!
-
-For simplicity, we are going to assume your package is called `tdm-kevin`, and lives in your home directory, with the following structure.
-
-[source,bash]
-----
-tree $HOME/tdm-kevin
-----
-
-.directory structure
-----
-tdm-kevin
-├── imdb
-│ ├── imdb.py
-│ └── __init__.py
-├── LICENSE
-├── pyproject.toml
-└── README.md
-
-1 directory, 5 files
-----
-
-The following are the starting contents of the author's `pyproject.toml`.
-
-.pyproject.toml
-----
-[build-system]
-requires = ["setuptools>=61.0.0", "wheel"]
-build-backend = "setuptools.build_meta"
-
-[project]
-name = "tdm-kevin"
-version = "0.0.1"
-description = "Get imdb ratings."
-readme = "README.md"
-authors = [{ name = "Kevin Amstutz", email = "kamstut@purdue.edu" }]
-license = { file = "LICENSE" }
-classifiers = [
- "License :: OSI Approved :: MIT License",
- "Programming Language :: Python",
- "Programming Language :: Python :: 3",
-]
-keywords = ["example", "imdb", "tutorial"]
-dependencies = [
- "lxml >= 4.9.1",
- "requests >= 2.28.1",
-]
-requires-python = ">=3.10"
-----
-
-If you look on Pypi, you will see that these bits of information directly correlate to different parts of https://pypi.org/project/tdm-kevin/0.0.1/[the associated project page]. For example, the `description` field shows up in a grey banner across the middle of the page. The `authors` appear in the meta section, etc.
-
-We want to take our package and go a new direction with it. We want it to end up being an API where we can query information about IMDB. Let's make the following modifications to our `pyproject.toml` file to reflect the new purpose of our package.
-
-. Update the `description` to describe the general idea of what our updated package will do.
-. Update the contents of our `LICENSE` file to be "MIT License (MIT)" -- the text of our license was just too much on the rendered https://pypi.org/project/tdm-kevin/0.0.1/[project page].
-. Our API will use the https://fastapi.tiangolo.com/[FastAPI] package. Check out https://pypi.org/classifiers/[the pypi classifiers] and see if there is an appropriate "FastAPI" classifier. If so, please add it to our `classifiers`.
-. Update the `keywords` to be any set of keywords you think is appropriate. No change is required.
-. Update the `README.md` file to list the package name and a short description of the project. Could be anything, for now.
-
-Now, go ahead and test out our changes by building and publishing our package on https://test.pypi.org.
-
-. Open a terminal from within Jupyter Lab and run: `module load python/jupyterlab`.
-. Create a virtual environment called `p10q01`.
-. Activate the newly created virtual environment.
-. Install `twine` and `build`: `python3 -m pip install twine build`.
-. Build the package: `python3 -m build $HOME/tdm-kevin`.
-. Check the package: `python3 -m twine check $HOME/tdm-kevin/dist/*`.
-. Upload to https://test.pypi.org: `python3 -m twine upload -r testpypi $HOME/tdm-kevin/dist/* --verbose`.
-
-What happens in the very last step? Any ideas why this happens? The reason is that you can only upload a given version of your package only once! You already uploaded version 0.0.1 in the previous project, so this gives an error! Let's fix this in the following question.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Of course, you _could_ simply change the `version` section of your `pyproject.toml` file, as well as the `__version__` part of your `imdb.py` file, however, it makes more sense to do this programmatically.
-
-Install `bumpver` to your `p10q01` virtual environment.
-
-For this package, we will use https://semver.org/[semantic versioning]. Read the "Summary" section so you can get a quick overview.
-
-. Navigate to your project directory, for example, `cd $HOME/tdm-kevin`.
-. Initiate `bumpver`: `python3 -m bumpver init`.
-+
-This will add a new section to your `pyproject.toml` update the values to look similar to the following.
-+
-.pyproject.toml
-----
-[tool.bumpver]
-current_version = "0.0.1"
-version_pattern = "MAJOR.MINOR.PATCH"
-commit_message = "Bump version {old_version} -> {new_version}"
-commit = true
-tag = true
-push = false
-
-[tool.bumpver.file_patterns]
-"pyproject.toml" = [
- 'current_version = "{version}"',
- 'version = "{version}"',
-]
-"imdb/__init__.py" = [
- "{version}",
-]
-----
-+
-. Use `bumpver` to bump the version a patch number: `python3 -m bumpver update --patch`.
-. Check out `pyproject.toml` and `__init__.py` and see how the version was increased -- cool!
-
-Finally, use `twine` to push your updates up to https://test.pypi.org followed by https://pypi.org.
-
-. Remove your old `dist` directory: `rm -rf $HOME/tdm-kevin/dist`.
-. Build your package: `python3 -m build $HOME/tdm-kevin`.
-. Upload to https://test.pypi.org: `python3 -m twine upload -r testpypi $HOME/tdm-kevin/dist/*`
-. Check out your package on https://test.pypi.org to make sure it looks good.
-. Once satisfied, use `twine` to upload to https://pypi.org: `python3 -m twine upload $HOME/tdm-kevin/dist/*`.
-. Check the page out at https://pypi.org.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Okay! You now have version 0.0.2 of your package published. Cool beans. Let's add a barebones https://fastapi.tiangolo.com/[FastAPI API] that we will build on in future projects.
-
-In your `tdm-kevin/imdb` directory add the following two files.
-
-.\\__main__.py
-----
-import argparse
-import sys
-import uvicorn
-
-def start_api(port: int):
- uvicorn.run("imdb.api:app", port=port, log_level="info")
-
-def main():
- parser = argparse.ArgumentParser()
- subparsers = parser.add_subparsers(help="possible commands", dest="command")
- some_parser = subparsers.add_parser("imdb", help="")
- some_parser.add_argument("-p", "--port", help="port to run on", type=int)
-
- if len(sys.argv) == 1:
- parser.print_help()
- sys.exit(1)
-
- args = parser.parse_args()
-
- if args.command == "imdb":
- start_api(port = args.port)
-
-if __name__ == "__main__":
- main()
-----
-
-.api.py
-----
-from fastapi import FastAPI
-from fastapi.templating import Jinja2Templates
-
-
-app = FastAPI()
-templates = Jinja2Templates(directory='templates/')
-
-
-@app.get("/")
-async def root():
- """
- Returns a simple message, "Hello World!"
- Returns:
- dict: The response JSON.
- """
- return {"message": "Hello World"}
-----
-
-Next, install the required packages to your `p10q01` virtual environment.
-
-[source,bash]
-----
-module load libffi/3.3
-python3 -m pip install jinja2 lxml fastapi "uvicorn[standard]"
-----
-
-You are now ready to _run_ your API. First, navigate to your project directory.
-
-[source,bash]
-----
-cd $HOME/tdm-kevin
-----
-
-Next, run the API.
-
-[source,bash]
-----
-python3 -m uvicorn imdb.api:app --reload --port 7777
-----
-
-[IMPORTANT]
-====
-If that command fails with an error stating "ERROR: Address already in use", this means that port 7777 is already in use.
-
-To easily find an available port that you can use, simply run the following.
-
-[source,bash]
-----
-find_port
-----
-
-This will print out a port number that is available and ready to use. For example, if I got "50377" as the output, I would run the following.
-
-[source,bash]
-----
-python3 -m uvicorn imdb.api:app --reload --port 50377
-----
-
-And, unless someone started using port 50377 in the time it took to find a port and execute that line, it should work.
-====
-
-Alright, if it is working and running, open a new terminal and test it out!
-
-[source,bash]
-----
-curl http://127.0.0.1:7777
-
-# or if you are using a different port
-curl http://127.0.0.1:50377
-----
-
-Great! Let's kill our API by holding Ctrl on your keyboard and then pressing "c".
-
-Once killed, let's call this a minor upgrade and bump our version by a minor version bump. Use `bumpver` to increase our version by a minor release.
-
-[source,bash]
-----
-cd $HOME/tdm-kevin
-python3 -m bumpver update --minor
-----
-
-Next, let's build and push up our new package version 0.1.0!
-
-[source,bash]
-----
-cd $HOME
-rm -rf $HOME/tdm-kevin/dist
-python3 -m build $HOME/tdm-kevin
-python3 -m twine upload -r testpypi $HOME/tdm-kevin/dist/*
-
-# if all looks well at test.pypi.org
-python3 -m twine upload $HOME/tdm-kevin/dist/*
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Create a new virtual environment called `p10q04`, activate the new environment, and install your package. For example, I would run the following.
-
-[source,bash]
-----
-deactivate
-module load python/jupyterlab
-cd $HOME
-python3 -m venv $HOME/p10q04
-source $HOME/p10q04/bin/activate
-python3 -m pip install tdm-kevin
-----
-
-Now, let's try to run our API.
-
-[source,bash]
-----
-python3 -m imdb
-----
-
-Uh oh! You probably got an error that `uvicorn` was not found! We forgot to list those extra packages as dependencies! In addition to all of that, let's make it so we can run a simple command to run our API. One thing at a time.
-
-First, open up your `pyproject.toml` file and update your `dependencies` to include: `fastapi>=0.85.2`, `Jinja2>=3.1.2`, `lxml>=4.9.1`, `uvicorn[standard]`. This should make it so that all of the required packages are installed into your virtual environment upon installing `tdm-kevin` (or your equivalent `tdm-` package).
-
-Next, add the following to your `pyproject.toml`.
-
-----
-[project.scripts]
-run_api = "imdb.__main__:main"
-----
-
-This _should_ make it so after you've installed the package you can simply run something like the following in order to run the API.
-
-[source,bash]
-----
-run_api imdb --port=7777
-----
-
-Let's test it all out!
-
-[source,bash]
-----
-cd $HOME
-deactivate
-source $HOME/p10q01/bin/activate
-rm -rf $HOME/tdm-kevin/dist
-cd $HOME/tdm-kevin
-python3 -m bumpver update --patch
-cd $HOME
-python3 -m build $HOME/tdm-kevin
-python3 -m twine upload -r $HOME/tdm-kevin/dist/*
-
-# if https://test.pypi.org looks good
-python3 -m twine upload $HOME/tdm-kevin/dist/*
-----
-
-Excellent! You've just published version 0.1.1 of your package! Let's see if things worked out.
-
-Deactivate your virtual environment, create a new environment called `p10`, activate the environment, and install your package. For example, I would run the following.
-
-[source,bash]
-----
-deactivate
-module load python/jupyterlab
-python3 -m venv $HOME/p10
-source $HOME/p10/bin/activate
-module load libffi/3.3
-python3 -m pip install tdm-kevin
-----
-
-[WARNING]
-====
-The `module load libffi/3.3` command is critical, otherwise you will likely run into an error installing your package.
-====
-
-Now, go ahead and give things a shot!
-
-[source,bash]
-----
-run_api imdb --port=7777
-----
-
-Very cool! Congratulations! You can use this package as a template for any other packages you may want to write!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-We've covered a _lot_ in a very short amount of time. Which parts of the last 3 projects would you want more instruction on? What lingering questions do you have? Please write at least 1 question that you'd like to have answered about the previous few projects.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
-
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project11.adoc
deleted file mode 100644
index 9ca18f6d0..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project11.adoc
+++ /dev/null
@@ -1,294 +0,0 @@
-= TDM 30100: Project 11 -- 2022
-
-**Motivation:** One of the primary ways to get and interact with data today is via APIs. APIs provide a way to access data and functionality from other applications. There are 3 very popular types of APIs that you will likely encounter in your work: RESTful APIs, GraphQL APIs, and gRPC APIs. Our focus for the remainder of the semester will be on RESTful APIs.
-
-**Context:** We are working on a series of projects that focus on building and using RESTful APIs. We will learn some basics about interacting and using APIs, and even build our own API.
-
-**Scope:** Python, APIs, requests, fastapi
-
-.Learning Objectives
-****
-- Understand and use the HTTP methods with the `requests` library.
-- Write REST APIs using the `fastapi` library to deliver data and functionality to a client.
-- Identify the various components of a URL.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Questions
-
-[WARNING]
-====
-We updated this project on Tuesday. This update includes changes to the commands run on a windows machine in question (3). We updated the commands as well as the shell (powershell instead of cmd). As noted in the updated question, please make sure to run powershell as an administrator before running the provided commands.
-====
-
-=== Question 1
-
-[WARNING]
-====
-If at any time you get stuck, please make a Piazza post and we will help you out!
-====
-
-For this project, we will be doing something a little different in order to _try_ to make the API development experience on Anvil more pleasant. In addition, I imagine many of you will enjoy what we are going to setup and use it for other projects (or maybe even corporate partners projects).
-
-Typically, when developing an API, you will have a set of code that you will update and modify. To see the results, you will run your API on a certain _port_ (for example 7777), and then interact with the API using a _client_. The most typical client is probably a web browser. So if we had an API running on port 7777, we could interact with it by navigating to `http://localhost:7777` in our browser.
-
-This is not so simple to do on Anvil, or at least not very enjoyable. While there are a variety of ways, the easiest is to use the "Desktop" app on https://ondemand.anvil.rcac.purdue.edu and use the provided editor and browser on the slow and clunky web interface. This is not ideal, and is what we want to avoid.
-
-Don't just take our word for it, try it out. Navigate to https://ondemand.anvil.rcac.purdue.edu and click on "Desktop" under "Interactive Apps". Choose the following:
-
-- Allocation: "cis220051"
-- Queue: "shared"
-- Wall Time in Hours: 1
-- Cores: 1
-
-Then, click on the "Launch" button. Wait a minute and click on the "Launch Desktop" button when it appears.
-
-Now, lets copy over our example API and run it.
-
-. Click on Applications > Terminal Emulator
-. Run the following commands:
-+
-[source,bash]
-----
-module use /anvil/projects/tdm/opt/core
-module load tdm
-module load python/f2022-s2023
-cp -a /anvil/projects/tdm/etc/hithere $HOME
-cd $HOME/hithere
-----
-+
-. Then, find an unused port by running the following:
-+
-[source,bash]
-----
-find_port # 50087
-----
-+
-. In our example the output was 50087. Now run the API using that port (the port _you_ found).
-+
-[source,bash]
-----
-python3 -m uvicorn imdb.api:app --reload --port 50087
-----
-
-Finally, the last step is to open a browser and check out the API.
-
-. Click on Applications > Web Browser
-. First navigate to `localhost:50087`
-. Next navigate to `localhost:50087/hithere/yourname`
-
-From here, your development process would be to modify the Python files, let the API reload with the changes, and interact with the API using the browser. This is all pretty clunky due to the slowness of the desktop-in-browser experience. In the remainder of this project we will setup something more pleasant.
-
-For this question, submit a screenshot of your work environment on https://ondemand.anvil.rcac.purdue.edu using the "Desktop" app. It would be best to include both the browser and terminal in the screenshot.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-The first step in this process is an easy one. Install https://code.visualstudio.com/[VS Code] on your local machine. This is a free, open source, and cross-platform editor. It is very popular and has a lot of great features that make it easy and enjoyable to use.
-
-For this question, submit a screenshot of your local machine with a VS Code window open.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-You may be wondering how we are going to use VS Code on your _local_ machine to develop on Anvil. The answer is we are going to use a tool called `ssh` along with a VSCode extension to make this process seamless.
-
-Read through https://the-examples-book.com/starter-guides/unix/ssh[this] page in order to gain a cursory knowledge of `ssh` and how to create public/private key pairs. Generate a public/private key pair on your local machine and add your public key to Anvil. For convenience, we've highlighted the steps below for both Mac and Windows.
-
-**Mac**
-
-. Open a terminal window on your local machine.
-. Run the following command to generate a public/private key pair:
-+
-[source,bash]
-----
-ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519
-----
-+
-. Click enter twice to _not_ enter a passphrase (for convenience, if you want to follow the other instructions, feel free).
-. Display the public key contents:
-+
-[source,bash]
-----
-cat ~/.ssh/id_ed25519.pub
-----
-+
-. Highlight the contents of the public key and copy it to your clipboard.
-. Navigate to https://ondemand.anvil.rcac.purdue.edu and click on "Clusters" > "Anvil Shell Access".
-. Once presented with a terminal, run the following.
-+
-[source,bash]
-----
-mkdir ~/.ssh
-vim ~/.ssh/authorized_keys
-
-# press "i" (for insert) then paste the contents of your public key on a newline
-# then press Ctrl+c, and type ":wq" to save and quit
-
-# set the permissions
-chmod 700 ~/.ssh
-chmod 644 ~/.ssh/authorized_keys
-chmod 644 ~/.ssh/known_hosts
-chmod 644 ~/.ssh/config
-chmod 600 ~/.ssh/id_ed25519
-chmod 644 ~/.ssh/id_ed25519.pub
-----
-. Now, confirm that it works by opening a terminal on your local machine and typing the following.
-+
-[source,bash]
-----
-ssh username@anvil.rcac.purdue.edu
-----
-+
-. Be sure to replace "username" with your _Anvil_ username, for example "x-kamstut".
-. Upon success, you should be immediately connected to Anvil _without_ typing a password -- cool!
-
-**Windows**
-
-https://learn.microsoft.com/en-us/windows-server/administration/openssh/openssh_keymanagement[This] article may be useful.
-
-. Open a powershell by right clicking on the powershell app and choosing "Run as administrator".
-. Run the following command to generate a public/private key pair:
-+
-[source,bash]
-----
-ssh-keygen -a 100 -t ed25519
-----
-+
-. Click enter twice to _not_ enter a passphrase (for convenience, if you want to follow the other instructions, feel free).
-. We need to make sure the permissions are correct for your `.ssh` directory and the files therein, otherwise `ssh` will not work properly. Run the following commands from a powershell (again, make sure powershell is running as administrator by right clicking and choosing "Run as administrator"):
-+
-[source,bash]
-----
-# from inside a powershell
-# taken from: https://superuser.com/a/1329702
-New-Variable -Name Key -Value "$env:UserProfile\.ssh\id_ed25519"
-Icacls $Key /c /t /Inheritance:d
-Icacls $Key /c /t /Grant ${env:UserName}:F
-TakeOwn /F $Key
-Icacls $Key /c /t /Grant:r ${env:UserName}:F
-Icacls $Key /c /t /Remove:g Administrator "Authenticated Users" BUILTIN\Administrators BUILTIN Everyone System Users
-# verify
-Icacls $Key
-Remove-Variable -Name Key
-----
-+
-. Display the public key contents:
-+
-[source,bash]
-----
-type %USERPROFILE%\.ssh\id_ed25519.pub
-----
-+
-. Highlight the contents of the public key and copy it to your clipboard.
-. Navigate to https://ondemand.anvil.rcac.purdue.edu and click on "Clusters" > "Anvil Shell Access".
-. Once presented with a terminal, run the following.
-+
-[source,bash]
-----
-mkdir ~/.ssh
-vim ~/.ssh/authorized_keys
-
-# press "i" (for insert) then paste the contents of your public key on a newline
-# then press Ctrl+c, and type ":wq" to save and quit
-
-# set the permissions
-chmod 700 ~/.ssh
-chmod 644 ~/.ssh/authorized_keys
-chmod 644 ~/.ssh/known_hosts
-chmod 644 ~/.ssh/config
-chmod 600 ~/.ssh/id_ed25519
-chmod 644 ~/.ssh/id_ed25519.pub
-----
-. Now, confirm that it works by opening a command prompt on your local machine and typing the following.
-+
-[source,bash]
-----
-ssh username@anvil.rcac.purdue.edu
-----
-+
-. Be sure to replace "username" with your _Anvil_ username, for example "x-kamstut".
-. Upon success, you should be immediately connected to Anvil _without_ typing a password -- cool!
-
-For this question, just include a sentence in a markdown cell stating whether or not you were able to get this working. If it is not working, the next question won't work either, so please post in Piazza for someone to help!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Finally, let's install the "Remote Explorer" or "Remote SSH" extension in VS Code. This extension will allow us to connect to Anvil from VS Code and develop on Anvil from our local machine. Once installed, click on the icon on the left-hand side of VS Code that looks like a computer screen.
-
-In the new menu on the left, click the little settings cog. Select the first option, which should be either `/Users/username/.ssh/config` (if on a mac) or `C:\Users\username\.ssh\config` (if on windows). This will open a file in VS Code. Add the following to the file:
-
-.mac config
-----
-Host anvil
- HostName anvil.rcac.purdue.edu
- User username
- IdentityFile ~/.ssh/id_ed25519
-----
-
-.windows config
-----
-Host anvil
- HostName anvil.rcac.purdue.edu
- User username
- IdentityFile C:\Users\username\.ssh\id_ed25519
-----
-
-Save the file and close out of it. Now, if all is well, you will see an "anvil" option under the "SSH TARGETS" menu. Right click on "anvil" and click "Connect to Host in Current Window". Wow! You will now be connected to Anvil! Try opening a file -- notice how the files are the files you have on Anvil -- that is super cool!
-
-Open a terminal in VS Code by pressing `Cmd+Shift+P` (or `Ctrl+Shift+P` on Windows) and typing "terminal". You should see a "Terminal: Create new terminal" option appear. Select it and you should notice a terminal opening at the bottom of your vscode window. That terminal is on Anvil too! Way cool! Run the api by running the following in the new terminal:
-
-[source,bash]
-----
-module use /anvil/projects/tdm/opt/core
-module load tdm
-module load python/f2022-s2023
-cd $HOME/hithere
-python3 -m uvicorn imdb.api:app --reload --port 50087
-----
-
-If you are prompted something about port forwarding allow it. In addition open up a browser on your own computer and test out the following links: `localhost:50087` and `localhost:50087/hithere/bob`. Wow! VS Code even takes care of forwarding ports so you can access the API from the comfort of your own computer and browser! This will be extremely useful for the rest of the semester!
-
-For this question, submit a couple of screenshots demonstrating opening code on Anvil from VS Code on your local computer, and accessing the API from your local browser.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-There are tons of cool extensions and themes in VS Code. Go ahead and apply a new theme you like and download some extensions.
-
-For this question, submit a screenshot of your tricked out VS Code setup with some Python code open. Have some fun!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project12.adoc
deleted file mode 100644
index 0762af19a..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project12.adoc
+++ /dev/null
@@ -1,263 +0,0 @@
-= TDM 30100: Project 12 -- 2022
-
-**Motivation:** RESTful APIs are everywhere! At some point in time, it will likely be the source of some data that you will want to use. What better way to understand how to interact with APIs than building your own?
-
-**Context:** This is the second to last project in a series around APIs. In this project, we will build a minimal API that does some basic operations, and in the following project we will build on top of that API and use _templates_ to build a "frontend" for our API.
-
-**Scope:** Python, fastapi, VSCode
-
-.Learning Objectives
-****
-- Understand and use the HTTP methods with the `requests` library.
-- Differentiate between graphql, REST APIs, and gRPC.
-- Write REST APIs using the `fastapi` library to deliver data and functionality to a client.
-- Identify the various components of a URL.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/movies_and_tv/imdb.db`
-
-In addition, the following is an illustration of the database to help you understand the data.
-
-image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"]
-
-For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above.
-
-== Questions
-
-=== Question 1
-
-Let's start by setting up our API, and getting a few things configured. This project will assume that you were able to connect and setup VSCode in the previous project. If you didn't do this, please go back and do that now. That project (aside from some initially incorrect windows commands) is pretty straightforward and at the end you have a super cool setup and easy way to work on Anvil using VSCode!
-
-. Open VSCode and connect to Anvil.
-. Hold Cmd+Shift+P (or Ctrl+Shift+P) to open the command palette, search for "Terminal: Create new terminal" and hit enter. This will open a terminal in VSCode that is connected to Anvil.
-. Copy over our project template into your `$HOME` directory.
-+
-[source,bash]
-----
-cp -r /anvil/projects/tdm/etc/imdb $HOME
-----
-+
-. Open the `$HOME/imdb` directory in VSCode.
-. Load up our `f2022-s2023` Python environment by running the following in the terminal in VSCode.
-+
-[source,bash]
-----
-module use /anvil/projects/tdm/opt/core
-module load tdm
-module load python/f2022-s2023
-----
-+
-. Go ahead and test out the provided, minimal API by running the following in the terminal in VSCode.
-+
-[source,bash]
-----
-find_port # returns a port, like 7777
-----
-+
-[source,bash]
-----
-python3 -m uvicorn imdb.api:app --reload --port 7777 # replace 7777 with the port you got from find_port
-----
-+
-. Open a browser on your computer and navigate to `localhost:7777`, but be sure to replace 7777 with the port you got from `find_port`. You should see a message that says "Hello World!". This is a JSON response, which is why your browser is showing it in a nice format.
-
-No need to turn anything in for this question -- it is integral for all of the remaining questions.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-If you check out `api.py` you will find two functions. The `root` function is responsible for the "Hello World" message you received in the previous question. As you can see, we returned the JSON response, which caused the data to be rendered that way. JSON is a data format that is _very_ common in RESTful APIs. For simiplicity, we will be using JSON for all of our responses, today.
-
-The other function, `read_item`, is responsible for the "hi there" stuff from the previous project. If you navigate to `localhost:7777/hithere/alice` you will see a webpage that displays the name "alice". The url path parameter `{name}` turns into a variable. So if you changed `alice` to `joe`, you would see "hi there joe" instead. This is a very common pattern in RESTful APIs.
-
-We are going to keep things as "simple" as possible, because there are so many components to this sort of project that it is easy to get confused and have something go wrong. Our goal is to wire our API up to the database, `imdb.db`, and create an endpoint that returns some structured data (as JSON).
-
-It is _highly_ recommended to go through the https://fastapi.tiangolo.com/tutorial/[official documentation tour]. It is well written and may provide examples that help you understand something we do for this project, better.
-
-Let's start with our problem statement. We want links like the following to display data about the given title: `localhost:7777/title/tt8946378`. Where `tt8946378` is the imdb.com title id for "Knives Out". Specifically, we want to start by displaying the: `primary_title`, `premiered`, and `runtime_minutes` from the `titles` table.
-
-Before we can even think about displaying the data, we need to wire up our database. Create a new file in the `imdb` directory called `database.py`. Include the following content.
-
-[source,python]
-----
-import os
-import aiosql
-import sqlite3
-from dotenv import load_dotenv
-from pathlib import Path
-
-load_dotenv()
-
-database_path = Path(os.getenv("DATABASE_PATH"))
-queries = aiosql.from_path(Path(__file__).parents[0] / "queries.sql", "sqlite3")
-----
-
-We are going to use the https://nackjicholson.github.io/aiosql/[`aiosql`] package to make queries to our database. This package is extremely simple (compared to other packages) and has (in my opinion) the best separation of SQL and Python code. It is also very easy to use (compared to other packages, at least). Let's walk through the code.
-
-. `load_dotenv()` load the environment variables from a `.env` file. Classically, the `.env` file is used to store sensitive credentials, like database passwords. In our case, our database has no password, so to demonstrate we are going to put our database path in an environment variable instead.
-+
-[IMPORTANT]
-====
-We haven't created a `.env` file yet, let's do that now! Create a text file named `.env` in your root directory (the outer `imdb` folder) and add the following contents:
-
-.env
-----
-DATABASE_PATH=/anvil/projects/tdm/data/movies_and_tv/imdb.db
-----
-
-Now, after `load_dotenv()` is called, the `os.getenv("DATABASE_PATH")` will return the path to our database, `/anvil/projects/tdm/data/movies_and_tv/imdb.db`.
-====
-+
-. `database_path` is simply the path loaded into a variable.
-. `queries` is an object that load up all of our SQL queries from a future `queries.sql` file, and allows us to easily make SQL queries from inside Python. We will give example of this later.
-
-Thats it! We can then import the `queries` object in our other Python modules in order to make queries, cool!
-
-No need to submit anything for this question either. The `database.py` file will be submitted at the end.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Okay, we "wired" our database up, but we need to actually make a query that returns all of the information we want to display, right?
-
-Create a new file called `queries.sql` in the inner `imdb` directory. This file will contain all of our SQL queries. The "comments" inside this file are critical for our `aiosql` package to identify the queries and load them into our `queries` object. The following is an example of a `queries.sql` file and Python code that uses it to make queries on a fake database.
-
-.queries.sql
-----
--- name: get_name_and_age
--- Get name and age of object.
-SELECT name, age FROM my_table WHERE myid = :myid;
-----
-
-[source,python]
-----
-conn = sqlite3.connect("fake.db")
-queries = aiosql.from_path("queries.sql", "sqlite3")
-results = queries.get_name(conn, myid=1)
-conn.close()
-print(results)
-
-# or, the following, which automatically closes the connection
-
-queries = aiosql.from_path("queries.sql", "sqlite3")
-conn = sqlite3.connect("fake.db")
-with conn as c:
- results = queries.get_name(c, myid=1)
-
-print(results)
-----
-
-.output
-----
-[("bob", 42), ("alice", 37)]
-----
-
-Add a query called `get_title` to your `queries.sql` file. This query should return the `primary_title`, `premiered`, and `runtime_minutes` from the `titles` table.
-
-In your `api.py` file, add a new function that will be used to eventually return a JSON with the title information. Call this function `get_title`.
-
-For now, just use the `queries` object to make a query to the database, and have the function return whatever the query returns. Once implemented, test it out by navigating to `localhost:7777/title/tt8946378` in your browser. You should see a (incorrectly) rendered response, with the info we wanted to display. We are getting there!
-
-For this question, include a screenshot like the following, but for a different title.
-
-image::figure37.webp[Example output, width=792, height=500, loading=lazy, title="Example output"]
-
-[NOTE]
-====
-If you use Chrome, your screenshot may look a bit different, that is OK.
-====
-
-[TIP]
-====
-The `read_item` is very similar, just more complicated than our `get_title` function. You can use it as a reference.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Okay! We were able to display our data, but it is not formatted correctly, and without any context, it is hard to say what 130 represents (runtime in minutes). Let's fix that by using the `pydantic` package to create a `Title` model. This model will be used to format our data before it is returned to the user. It is good practice to have all _responses_ be formatted using `pydantic` -- that way data is always returned in a consistent, expected format.
-
-Read https://fastapi.tiangolo.com/tutorial/sql-databases/?h=pydantic#create-pydantic-models-schemas-for-reading-returning[this] section of the offical documentation.
-
-Create a new file called `schemas.py` in the `imdb` directory. In this file, create a `Title` model that has all of the fields we want to display.
-
-In your `api.py` file, update your `get_title` function to return a `Title` object instead of the raw data from the database.
-
-[TIP]
-====
-To take a query result and convert it to a `pydantic` model, do the following (for example).
-
-[source,python]
-----
-queries = aiosql.from_path("queries.sql", "sqlite3")
-conn = sqlite3.connect("fake.db")
-with conn as c:
- results = queries.get_name(c, myid=1)
-
-results = {key: result[0][i] for i, key in enumerate(MyModel.__fields__.keys())}
-my_model = MyModel(**results)
-----
-====
-
-Navigate to `localhost:7777/title/tt8946378` in your browser. You should see a correctly formatted response, with the info we wanted to display. Your result should look like the following image, but for a different title.
-
-image::figure38.webp[Example output, width=792, height=500, loading=lazy, title="Example output"]
-
-Please submit the following things for this project.
-
-- A `.ipynb` file with a screenshot for question 3 and 4 added.
-- Your `api.py` file.
-- Your `database.py` file.
-- Your `queries.sql` file.
-- Your `schemas.py` file.
-
-Congratulations! You should feel accomplished! While it may not _feel_ like you did much, you wired together a database and backend API, made SQL queries from within Python, and formatted your data using `pydantic` models. That is a lot of work! Great job! Happy thanksgiving!
-
-[IMPORTANT]
-====
-If you have any questions, please post in Piazza and we will do our best to help you out!
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5 (optional, 0 points)
-
-Read https://fastapi.tiangolo.com/tutorial/sql-databases/?h=pydantic#__tabbed_2_3[the documentation] and update your API to include the `genres` in your response!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:projects:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project13.adoc
deleted file mode 100644
index 50107bdc6..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project13.adoc
+++ /dev/null
@@ -1,196 +0,0 @@
-= TDM 30100: Project 13 -- 2022
-
-**Motivation:** RESTful APIs are everywhere! At some point in time, it will likely be the source of some data that you will want to use. What better way to understand how to interact with APIs than building your own?
-
-**Context:** This is the last project in a series around APIs. In this project, we will use templates and `jinja2` to build a "frontend" for our API.
-
-**Scope:** Python, fastapi, VSCode
-
-.Learning Objectives
-****
-- Understand and use the HTTP methods with the `requests` library.
-- Differentiate between graphql, REST APIs, and gRPC.
-- Write REST APIs using the `fastapi` library to deliver data and functionality to a client.
-- Identify the various components of a URL.
-- Use a templating engine and HTML to display data from our API.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/movies_and_tv/imdb.db`
-
-In addition, the following is an illustration of the database to help you understand the data.
-
-image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"]
-
-For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above.
-
-== Questions
-
-=== Question 1
-
-For this project, we've provided you with a ready-made API, very similar to the results of the previous project, but with a bit more to work with. You'll be relieved to hear, however, that you will be primarily working with a discrete set of HTML template files, and not much else.
-
-Start this project just like you did in the previous project.
-
-. Open VSCode and connect to Anvil.
-. Hold Cmd+Shift+P (or Ctrl+Shift+P) to open the command palette, search for "Terminal: Create new terminal" and hit enter. This will open a terminal in VSCode that is connected to Anvil.
-. Copy over our project template into your `$HOME` directory.
-+
-[source,bash]
-----
-cp -r /anvil/projects/tdm/etc/imdb2 $HOME
-----
-+
-. Open the `$HOME/imdb` directory in VSCode.
-. Load up our `f2022-s2023` Python environment by running the following in the terminal in VSCode.
-+
-[source,bash]
-----
-module use /anvil/projects/tdm/opt/core
-module load tdm
-module load python/f2022-s2023
-----
-+
-. Go ahead and test out the provided, minimal API by running the following in the terminal in VSCode.
-+
-[source,bash]
-----
-find_port # returns a port, like 7777
-----
-+
-[source,bash]
-----
-python3 -m uvicorn imdb.api:app --reload --port 7777 # replace 7777 with the port you got from find_port
-----
-+
-. Open a browser on your computer and navigate to `localhost:7777`, but be sure to replace 7777 with the port you got from `find_port`. You should see a message that says "Hello World!". This is a JSON response, which is why your browser is showing it in a nice format.
-
-Finally, check out the new code base and the following new endpoints.
-
-- `localhost:7777/api/titles/tt4236770`
-- `localhost:7777/api/cast/tt4236770`
-- `localhost:7777//api/person/nm0000126`
-
-Like before, these endpoints all return appropriately formatted JSON objects. Within our Python code, we have nice `pydantic` objects to work with. However, we want to display all of this data in a nice, human-readable format. This is often referred to as a _frontend_. Often times a frontend will use a completely different set of technologies, and simply use the API to fetch specially structure data. In this case, we are going to use `fastapi` to build our frontend. We can do this by using a templating engine (built into `fastapi`), and in this case, we will be using `jinja2`.
-
-For this question, you don't need to submit anything, as you'll need to have all of it working to continue.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Check out the one and only template provided in the `templates` directory, `hithere.html`. When navigating to `localhost:7777/hithere/alice` you'll be greeted with a message saying "Hi there: alice!".
-
-[IMPORTANT]
-====
-You will need to update the `hithere` function URL to use the port you are using instead of port 7777.
-====
-
-The content of the template is simple.
-
-[source,html]
-----
-
-
- Hi there!
-
-
-
Hi there: {{ name.my_name }}!
-
-
-----
-
-When you navigate to `localhost:7777/hithere/alice` `fastapi` sends a request to our api endpoint `localhost:7777/api/hithere/alice` and send the response to our template, `hithere.html`. The template can then access the name by surrounding the variable with double curly braces and dot notation.
-
-This whole process emulates what a regular frontend would do. First make a request to get the data (in our case, in JSON format), then pass the response to some sort of frontend system (in our case a template engine that chooses how to display the data).
-
-Let's start by creating a single new HTML template called `title.html`. This template will be used to display the information about a single title. The template should be located in the `templates` directory. Let's start the template with a basic HTML skeleton.
-
-[source,html]
-----
-
-
- Title
-
-
-
-
-
-----
-
-Create a new endpoint in `api.py`: `localhost:7777/titles/{some_title_id}`. This endpoint should behave similarly to the `hithere` function. It should first make a request to our api, `localhost:7777/api/titles/{some_title_id}`, and then pass the response along to the `title.html` template.
-
-Once complete, go back to your `title.html` template, and modify it so it displays the `primary_title` in an `h1` tag. In addition, display the rest of the data _except_ the `genres`. You can choose how to display, or rather, what HTML tags to use to display the remaining data.
-
-Test it out by navigating to: `localhost:7777/titles/tt4236770`.
-
-[TIP]
-====
-Check out this post for examples on accessing data, using conditionals (if/else), and loops in `jinja2`.
-
-https://realpython.com/primer-on-jinja-templating/#get-started-with-jinja
-====
-
-.Items to submit
-====
-- A screenshot displaying the webpage for `localhost:7777/titles/tt4236770`.
-====
-
-=== Question 3
-
-In the previous question, you learned how to take a request and modify the template to display the structured data returned from the request (the response) using `jinja2` templating.
-
-In the previous question, you displayed data for a title _except_ for the genre data. The genre data is a list of strings. To access the genres from within a `jinja2` template, you will need to loop through the genres and display them. See https://realpython.com/primer-on-jinja-templating/#leverage-for-loops[this] article for an example. _How_ you decide to display the data (what HTML tags to use) is up to you!
-
-.Items to submit
-====
-- A screenshot displaying the webpage for `localhost:7777/titles/tt4236770`.
-====
-
-=== Question 4
-
-Practice makes perfect. Create a new template called `person.html`. As you may guess, we want this template to display the name of the person of interest, and a list of the `primary_title` for all of their works. Create a new endpoint at `localhost:7777/person/{some_person_id}`. This endpoint should first make a request to our api at `localhost:7777/api/person/{some_person_id}` and then pass the response along to the `person.html` template.
-
-How you display the data is up to you. I displayed the name of the person in a big h1 tag and listed all of the `primary_title` data in a list of p tags. It doesn't need to be pretty!
-
-.Items to submit
-====
-- A screenshot displaying the webpage for `localhost:7777/person/nm0000126`.
-====
-
-=== Question 5
-
-Create a new template called `cast.html`. As you may guess, we want this template to display the cast for a given a title. Create a new endpoint at `localhost:7777/cast/{some_title_id}`. This endpoint should first make a request to our api at `localhost:7777/api/cast/{some_title_id}` and then pass the response along to the `cast.html` template.
-
-This should be _extremely_ similar to question (3)! Please have a nice h1 header with the name of the title, and a list of cast members. We are only going to include 1 small twist. For every cast member name you display, make the cast member name itself be a link that links back to the person's page (created in the previous question). This way, when you navigate to `localhost:7777/cast/tt4236770`, you can click on any of the cast member names and be taken to their page. Very cool!
-
-.Items to submit
-====
-- A screenshot displaying the webpage for `http://localhost:7777/cast/tt4236770`.
-- A screenshot displaying the webpage for one of the cast members (someone other than Kevin Costner).
-====
-
-=== Question 6 (optional, 0 points)
-
-Update the `title.html` template so that the primary title is displayed in green if the rating of the title is 8.0 or higher, and red otherwise.
-
-.Items to submit
-====
-- A screenshot displaying an instance where the page is displayed in green and an instance where the page is displayed in red.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:projects:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-projects.adoc
deleted file mode 100644
index 7050efaa8..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-projects.adoc
+++ /dev/null
@@ -1,41 +0,0 @@
-= TDM 30100
-
-== Project links
-
-[NOTE]
-====
-Only the best 10 of 13 projects will count towards your grade.
-====
-
-[CAUTION]
-====
-Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses.
-====
-
-[%header,format=csv,stripes=even,%autowidth.stretch]
-|===
-include::ROOT:example$30100-2022-projects.csv[]
-|===
-
-[WARNING]
-====
-Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete.
-
-**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information.
-
-Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza.
-====
-
-== Piazza
-
-=== Sign up
-
-https://piazza.com/purdue/fall2022/tdm30100[https://piazza.com/purdue/fall2022/tdm30100]
-
-=== Link
-
-https://piazza.com/purdue/fall2022/tdm30100/home[https://piazza.com/purdue/fall2022/tdm30100/home]
-
-== Syllabus
-
-See xref:fall2022/logistics/syllabus.adoc[here].
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project01.adoc
deleted file mode 100644
index 98e34cbef..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project01.adoc
+++ /dev/null
@@ -1,311 +0,0 @@
-= TDM 40100: Project 1 -- 2022
-
-**Motivation:** It's been a long summer! Last year, you got some exposure command line tools, SQL, Python, and other fun topics like web scraping. This semester, we will continue to work primarily using Python _with_ data. Topics will include things like: documentation using tools like sphinx, or pdoc, writing tests, sharing Python code using tools like pipenv, poetry, and git, interacting with and writing APIs, as well as containerization. Of course, like nearly every other project, we will be be wrestling with data the entire time.
-
-We will start slowly, however, by learning about Jupyter Lab. This year, instead of using RStudio Server, we will be using Jupyter Lab. In this project we will become familiar with the new environment, review some, and prepare for the rest of the semester.
-
-**Context:** This is the first project of the semester! We will start with some review, and set the "scene" to learn about a variety of useful and exciting topics.
-
-**Scope:** Jupyter Lab, R, Python, Anvil, markdown, lmod
-
-.Learning Objectives
-****
-- Read about and understand computational resources available to you.
-- Learn how to run R code in Jupyter Lab on Anvil.
-- Review, mess around with `lmod`.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/flights/subset/1991.csv`
-- `/anvil/projects/tdm/data/movies_and_tv/imdb.db`
-
-== Questions
-
-=== Question 1
-
-++++
-
-++++
-
-++++
-
-++++
-
-For this course, projects will be solved using the https://www.rcac.purdue.edu/compute/anvil[Anvil computing cluster].
-
-Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (RAM). Use the information on the provided webpages to calculate how many cores and how much memory is available _in total_ for the Anvil "sub-clusters".
-
-Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer of your own, work with a friend to see how many cores there are, and how much memory is available, on their computer.
-
-[NOTE]
-====
-Last year, we used the https://www.rcac.purdue.edu/compute/brown[Brown computing cluster]. Compare the specs of https://www.rcac.purdue.edu/compute/anvil[Anvil] and https://www.rcac.purdue.edu/compute/brown[Brown] -- which one is more powerful?
-====
-
-.Items to submit
-====
-- A sentence explaining how many cores and how much memory is available, in total, across all nodes in the sub-clusters on Anvil.
-- A sentence explaining how many cores and how much memory is available, in total, for your own computer.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Like the previous year we will be using Jupyter Lab on the Anvil cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster.
-
-Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (and Duo). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward.
-
-[TIP]
-====
-If you did not (yet) setup your 2-factor authentication credentials with Duo, you can go back to Step 9 and setup the credentials here: https://the-examples-book.com/starter-guides/anvil/access-setup
-====
-
-Towards the middle of the top menu, there will be an item labeled btn:[My Interactive Sessions], click on btn:[My Interactive Sessions]. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Bioinformatics, Interactive Apps, and The Data Mine. Under "The Data Mine" section, you should see a button that says btn:[Jupyter Notebook], click on btn:[Jupyter Notebook].
-
-If everything was successful, you should see a screen similar to the following.
-
-image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"]
-
-Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 2 CPU cores and 3800 Mb.
-
-[NOTE]
-====
-It is OK to not understand what that means yet, we will learn more about this in TDM 30100. For the curious, however, if you were to open a terminal session in Anvil and run the following, you would see your job queued up.
-
-[source,bash]
-----
-squeue -u username # replace 'username' with your username
-----
-====
-
-[NOTE]
-====
-If you select 4000 Mb of memory instead of 3800 Mb, you will end up getting 3 CPU cores instead of 2. OnDemand tries to balance the memory to CPU ratio to be _about_ 1900 Mb per CPU core.
-====
-
-We use the Anvil cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine.
-
-After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on btn:[Connect to Jupyter] to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following.
-
-image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"]
-
-There are 2 primary options that you will need to know about.
-
-f2022-s2023::
-The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment.
-
-[TIP]
-====
-To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page].
-====
-
-f2022-s2023-r::
-An alternative, native R kernel that you can use for projects with _just_ R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell.
-
-For now, let's focus on the f2022-s2023 kernel. Click on btn:[f2022-s2023], and a fresh notebook will be created for you.
-
-[NOTE]
-====
-Soon, we'll have the f2022-s2023-r kernel available and ready to use!
-====
-
-Test it out! Run the following code in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on. What is the name of the node on Anvil that you are running on?
-
-[source,python]
-----
-import socket
-print(socket.gethostname())
-----
-
-[TIP]
-====
-To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu.
-====
-
-.Items to submit
-====
-- Code used to solve this problem in a "code" cell.
-- Output from running the code (the name of the node on Anvil that you are running on).
-====
-
-=== Question 3
-
-++++
-
-++++
-
-++++
-
-++++
-
-In the upper right-hand corner of your notebook, you will see the current kernel for the notebook, `f2022-s2023`. If you click on this name you will have the option to swap kernels out -- no need to do this yet, but it is good to know!
-
-Practice running the following examples.
-
-python::
-[source,python]
-----
-my_list = [1, 2, 3]
-print(f'My list is: {my_list}')
-----
-
-SQL::
-[source, sql]
-----
-%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db
-----
-
-[source, ipython]
-----
-%%sql
-
-SELECT * FROM titles LIMIT 5;
-----
-
-[NOTE]
-====
-In a previous semester, you'd need to load the sql extension first -- this is no longer needed as we've made a few improvements!
-
-[source,ipython]
-----
-%load_ext sql
-----
-====
-
-bash::
-[source,bash]
-----
-%%bash
-
-awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' /anvil/projects/tdm/data/flights/subset/1991.csv
-----
-
-[TIP]
-====
-To learn more about how to run various types of code using this kernel, see https://the-examples-book.com/projects/templates[our template page].
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-This year, the first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`).
-
-Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`.
-
-There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default?
-
-Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A.
-
-.Items to submit
-====
-- How many of each types of cells are there in the default template?
-====
-
-=== Question 5
-
-Make a markdown cell containing a list of every topic and/or tool you wish was taught in The Data Mine -- in order of _most_ interested to _least_ interested.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 6
-
-Review your Python, R, and bash skills. For each language, choose at least 1 dataset from `/anvil/projects/tdm/data`, and analyze it. Both solutions should include at least 1 custom function, and at least 1 graphic output.
-
-[NOTE]
-====
-Your `bash` solution can be both plotless and without a custom function.
-====
-
-Make sure your code is complete, and well-commented. Include a markdown cell with your short analysis (1 sentence is fine), for each language.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 7
-
-++++
-
-++++
-
-The module system, `lmod`, is extremely popular on HPC (high performance computing) systems. Anvil is no exception!
-
-In a terminal, take a look at the modules available to you by default.
-
-[source,bash]
-----
-module avail
-----
-
-Notice that at the very top, you'll have a list named: `/anvil/projects/tdm/opt/lmod`.
-
-Now run the following.
-
-[source,bash]
-----
-module reset
-module avail
-----
-
-Notice how the set of available modules changes! By default, we have it loaded up with some Datamine-specific modules. To manually load up those modules, run the following.
-
-[source,bash]
-----
-module use /anvil/projects/tdm/opt/core
-module avail
-----
-
-Notice how at the very top, there is a new section named `/anvil/projects/tdm/opt/core` with a single option, `tdm/default`.
-
-Go ahead and load up `tdm/default`.
-
-[source,bash]
-----
-module load tdm
-module avail
-----
-
-It looks like we are (pretty much) back to where we started off! This is useful to know in case there is ever a situation where you'd like to SSH into Anvil and load up our version of Python with the packages we have ready-made for you to use.
-
-To finish off this "question", run the following and make a note in your notebook what the result is.
-
-[source,bash]
-----
-which python3
-----
-
-Okay, now, load up our `python/f2022-s2023` module and run `which python3` once again. What is the result? Surprised by the result? Any ideas what this is doing? If you are curious, feel free to ask in Piazza! Otherwise, congratulations, you've made it through the first project!
-
-.Items to submit
-====
-- `which python3` before and after loading the `python/f2022-s2023` module.
-- Any other commentary you'd like to include.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project02.adoc
deleted file mode 100644
index 06dbde988..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project02.adoc
+++ /dev/null
@@ -1,231 +0,0 @@
-= TDM 40100: Project 2 -- 2022
-
-**Motivation:** The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a `sqlite3` database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, `sqlite3`, with the opportunity to use other skills you've built throughout the previous years.
-
-**Context:** In TDM 20100 (formerly STAT 29000), you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with `sqlite3` -- a powerful database engine. In this project (and following projects), we will branch into SQL and `sqlite3`-specific topics and techniques that you haven't yet had exposure to in The Data Mine.
-
-**Scope:** `sqlite3`, lmod, SQL
-
-.Learning Objectives
-****
-- Create your own `sqlite3` database file.
-- Analyze a large dataset and formulate `CREATE TABLE` statements designed to store the data.
-- Insert data into your database.
-- Run one or more queries to test out the end result.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_book_series.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_books.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json`
-
-== Questions
-
-=== Question 1
-
-The goodreads dataset has a variety of files: `/anvil/projects/tdm/data/goodreads/original`. With that being said there are 4 files which hold the bulk of the data. The rest is _mostly_ derivitives of those 4 files.
-
-- `/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_book_series.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_books.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json`
-
-Take a look at the 4 files included in this dataset. How many bytes of data total do the 4 files take up on the filesystem?
-
-[TIP]
-====
-You can use `du` in a `bash` cell to get this information.
-====
-
-_Approximately_ how many books and how many reviews are included in the datasets?
-
-Finally, take a look at the first book.
-
-----
-{"isbn": "0312853122", "text_reviews_count": "1", "series": [], "country_code": "US", "language_code": "", "popular_shelves": [{"count": "3", "name": "to-read"}, {"count": "1", "name": "p"}, {"count": "1", "name": "collection"}, {"count": "1", "name": "w-c-fields"}, {"count": "1", "name": "biography"}], "asin": "", "is_ebook": "false", "average_rating": "4.00", "kindle_asin": "", "similar_books": [], "description": "", "format": "Paperback", "link": "https://www.goodreads.com/book/show/5333265-w-c-fields", "authors": [{"author_id": "604031", "role": ""}], "publisher": "St. Martin's Press", "num_pages": "256", "publication_day": "1", "isbn13": "9780312853129", "publication_month": "9", "edition_information": "", "publication_year": "1984", "url": "https://www.goodreads.com/book/show/5333265-w-c-fields", "image_url": "https://images.gr-assets.com/books/1310220028m/5333265.jpg", "book_id": "5333265", "ratings_count": "3", "work_id": "5400751", "title": "W.C. Fields: A Life on Film", "title_without_series": "W.C. Fields: A Life on Film"}
-----
-
-As you can see, there is an `image_url` included for each book. Use `bash` tools to download one of the images to `$HOME/p02output`. How much space does it take up (in bytes)?
-
-[TIP]
-====
-Use `wget` to download the image. Rather than using `cd` to first navigate to `$HOME/p02output` before using `wget` to download the image, instead, use a `wget` _option_ to specify the directory to download to.
-====
-
-[NOTE]
-====
-It is okay to manually copy/paste the link from the json.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-We decided we want to download more than 1 image in order to approximate how much space the images take on average.
-
-[IMPORTANT]
-====
-In the previous question we said it was okay to manually copy/paste the `image_url` -- this time, you _probably_ won't want to do that. You can use a `bash` tool called `jq` to extract the links automatically. `jq` is located `/anvil/projects/tdm/bin/jq`.
-
-The `--raw-output` option to `jq` will be useful as well.
-====
-
-Use `bash` tools (and only `bash` tools, from within a `bash` cell) to download 25 **random** book images to `$HOME/p02output`, and calculate the average amount of space that each image takes up. Use that information to estimate how much space it would take to store the images for all of the book in the dataset.
-
-[TIP]
-====
-Take a look at the `shuf` command in `bash`: `man shuf`.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Okay, so _roughly_, in total, we are looking at around 34 gb of data. With that size it will _definitely_ be useful for us to create a database. After all, answering questions like:
-
-- What is the average rating of Brandon Sandersons books?
-- What are the titles of the 5 books with the most number of ratings?
-- Etc.
-
-Is not very straightforward if we handed you this data and said "get that info please", _but_, if we had a nice `sqlite` database -- it would be easy! So let's start planning this out.
-
-First, before we do that, it would make sense to get a sample of each of the datasets. Working with samples just makes it a lot easier to load the data up and parse through it.
-
-Use `shuf` to get a random sample of the `goodreads_books.json` and `goodreads_reviews_dedup.json` datasets. Approximate how many rows you'd need in order to get the datasets down to around 100 mb each, and do so. Put the samples, and copies of `goodreads_book_authors.json` and `goodreads_book_series.json` in `$HOME/goodreads_samples`.
-
-[NOTE]
-====
-It just needs to be approximately 100mb -- no need to fuss, as long as it is within 50mb we are good.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Check out the 5 storage classes (which you can think of as types) that `sqlite3` uses: https://www.sqlite.org/datatype3.html
-
-In a markdown cell, write out each of the keys in each of the json files, and list the appropriate storage class to use. For example, I've provided you an example of what we are looking for, for the `goodreads_reviews_dedup.json`.
-
-- user_id: TEXT
-- book_id: INTEGER
-- review_id: TEXT
-- rating: INTEGER
-- review_text: TEXT
-- date_added: TEXT
-- date_updated: TEXT
-- read_at: TEXT
-- started_at: TEXT
-- n_votes: INTEGER
-- n_comments: INTEGER
-
-[NOTE]
-====
-You don't need to copy/paste the solution for `goodreads_reviews_dedup.json` since we provided it for you.
-====
-
-[IMPORTANT]
-====
-You do not need to assign a type to the following keys in `goodreads_books.json`: `series`, `popular_shelves`, `similar_books`, and `authors`.
-====
-
-[TIP]
-====
-- Assume `isbn`, `asin`, `kindle_asin`, `isbn13` columns _could_ start with a leading 0.
-- Assume any column ending in `_id` could _not_ start with a leading 0.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-[WARNING]
-====
-Please include the `CREATE TABLE` statements in code cells for this question, but realize that you will have to pop open a terminal and launch `sqlite3` to complete this problem.
-
-To do so run the following in the new terminal.
-
-[source,bash]
-----
-module use /anvil/projects/tdm/opt/core
-module load tdm
-module load sqlite/3.39.2
-
-sqlite3 my.db # this will create an empty database
-----
-
-You will then be inside a `sqlite3` session and able to run `sqlite`-specific dot functions (which you can see after running `.help`), or SQL queries.
-====
-
-For now, let's ignore the "problematic" columns in the `goodreads_books.json` dataset (`series`, `popular_shelves`, `similar_books`, and `authors`).
-
-Translate the work you did in the previous question to 4 `CREATE TABLE` statements that will be used to create your `sqlite3` database tables. Check out some examples https://www.sqlitetutorial.net/sqlite-create-table/[here]. For now, let's keep it straightforward -- ignore primary and foreign keys, and just focus on building the 4 tables with the correct types. Similarly, don't worry about any of the restrictions like `NOT NULL` or `UNIQUE`. Name your tables: `reviews`, `books`, `series`, and `authors`.
-
-Once you've created your `CREATE TABLE` statements, create a database called `my.db` in your `$HOME` directory -- so `$HOME/my.db`. Run your `CREATE TABLE` statements, and, in your notebook, verify the database has been created properly by running the following.
-
-[source,ipython]
-----
-%sql sqlite:////home/x-kamstut/my.db # change x-kamstut to your username
-----
-
-[source,ipython]
-----
-%%sql
-
-SELECT sql FROM sqlite_master WHERE name='reviews';
-----
-
-[source,ipython]
-----
-%%sql
-
-SELECT sql FROM sqlite_master WHERE name='books';
-----
-
-[source,ipython]
-----
-%%sql
-
-SELECT sql FROM sqlite_master WHERE name='series';
-----
-
-[source,ipython]
-----
-%%sql
-
-SELECT sql FROM sqlite_master WHERE name='authors';
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project03.adoc
deleted file mode 100644
index 3c2908bb7..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project03.adoc
+++ /dev/null
@@ -1,356 +0,0 @@
-= TDM 40100: Project 3 -- 2022
-
-**Motivation:** The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a `sqlite3` database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, `sqlite3`, with the opportunity to use other skills you've built throughout the previous years.
-
-**Context:** In TDM 20100 (formerly STAT 29000), you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with `sqlite3` -- a powerful database engine. In this project (and following projects), we will branch into SQL and `sqlite3`-specific topics and techniques that you haven't yet had exposure to in The Data Mine.
-
-**Scope:** `sqlite3`, lmod, SQL
-
-.Learning Objectives
-****
-- Create your own `sqlite3` database file.
-- Analyze a large dataset and formulate `CREATE TABLE` statements designed to store the data.
-- Run one or more queries to test out the end result.
-- Demonstrate the ability to normalize a series of database tables.
-- Wrangle and insert data into database.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_book_series.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_books.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json`
-
-== Questions
-
-In case you skipped the previous project, let's all get on the same page. You run the following code in a Jupyter notebook to create a `sqlite3` database called `my.db` in your `$HOME` directory.
-
-[source,ipython]
-----
-%%bash
-
-rm $HOME/my.db
-sqlite3 $HOME/my.db "CREATE TABLE reviews (
- user_id TEXT,
- book_id INTEGER,
- review_id TEXT,
- rating INTEGER,
- review_text TEXT,
- date_added TEXT,
- date_updated TEXT,
- read_at TEXT,
- started_at TEXT,
- n_votes INTEGER,
- n_comments INTEGER
-);"
-
-sqlite3 $HOME/my.db "CREATE TABLE books (
- isbn TEXT,
- text_reviews_count INTEGER,
- country_code TEXT,
- language_code TEXT,
- asin TEXT,
- is_ebook INTEGER,
- average_rating REAL,
- kindle_asin TEXT,
- description TEXT,
- format TEXT,
- link TEXT,
- publisher TEXT,
- num_pages INTEGER,
- publication_day INTEGER,
- isbn13 TEXT,
- publication_month INTEGER,
- edition_information TEXT,
- publication_year INTEGER,
- url TEXT,
- image_url TEXT,
- book_id TEXT,
- ratings_count INTEGER,
- work_id TEXT,
- title TEXT,
- title_without_series TEXT
-);"
-
-sqlite3 $HOME/my.db "CREATE TABLE authors (
- average_rating REAL,
- author_id INTEGER,
- text_reviews_count INTEGER,
- name TEXT,
- ratings_count INTEGER
-);"
-
-sqlite3 $HOME/my.db "CREATE TABLE series (
- numbered INTEGER,
- note TEXT,
- description TEXT,
- title TEXT,
- series_works_count INTEGER,
- series_id INTEGER,
- primary_work_count INTEGER
-);"
-----
-
-[source,ipython]
-----
-%sql sqlite:////home/x-myalias/my.db
-----
-
-[source,ipython]
-----
-%%sql
-
-SELECT * FROM reviews limit 5;
-----
-
-[source,ipython]
-----
-%%sql
-
-SELECT * FROM books limit 5;
-----
-
-[source,ipython]
-----
-%%sql
-
-SELECT * FROM authors limit 5;
-----
-
-[source,ipython]
-----
-%%sql
-
-SELECT * FROM series limit 5;
-----
-
-[source,ipython]
-----
-%%bash
-
-rm -rf $HOME/goodreads_samples
-mkdir $HOME/goodreads_samples
-cp /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json $HOME/goodreads_samples/
-cp /anvil/projects/tdm/data/goodreads/goodreads_book_series.json $HOME/goodreads_samples/
-shuf -n 27450 /anvil/projects/tdm/data/goodreads/goodreads_books.json > $HOME/goodreads_samples/goodreads_books.json
-shuf -n 98375 /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json > $HOME/goodreads_samples/goodreads_reviews_dedup.json
-----
-
-=== Question 1
-
-Update your original `CREATE TABLE` statement for the `books` table to include a field that will be used to store the actual book cover images from the `image_url` field in the `books` table. Call this new field `book_cover`. Which one of the `sqlite` types did you use?
-
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Check out a line of the `goodreads_books.json` data:
-
-[source,ipython]
-----
-%%bash
-
-head -n 1 $HOME/goodreads_samples/goodreads_books.json
-----
-
-[IMPORTANT]
-====
-Don't have a `goddreads_samples` directory? Make sure you run the following.
-
-[source,ipython]
-----
-%%bash
-
-rm -rf $HOME/goodreads_samples
-mkdir $HOME/goodreads_samples
-cp /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json $HOME/goodreads_samples/
-cp /anvil/projects/tdm/data/goodreads/goodreads_book_series.json $HOME/goodreads_samples/
-shuf -n 27450 /anvil/projects/tdm/data/goodreads/goodreads_books.json > $HOME/goodreads_samples/goodreads_books.json
-shuf -n 98375 /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json > $HOME/goodreads_samples/goodreads_reviews_dedup.json
-----
-====
-
-Recall that in the previous project, we just ignored the following fields from the `books` table: `series`, `similar_books`, `popular_shelves`, and `authors`. We did this because those fields are more complicated to deal with.
-
-Read https://docs.microsoft.com/en-us/office/troubleshoot/access/database-normalization-description[this] article on database normalization from Microsoft. We are going to do our best to _normalize_ our tables with these previously ignored fields taken into consideration.
-
-Let's start by setting some practical naming conventions. Note that these are not critical by any stretch, but can help remove some guess work when navigating a database with many tables and ids.
-
-. Every table's primary key should be named `id`, unless it is a composite key. For example, instead of `book_id` in the `books` table, it would make sense to call that column `id` -- "book" is implied from the table name.
-. Every table's foreign key should reference the `id` column of the foreign table and be named "foreign_table_name_id". For example, if we had a foreign key in the `books` table that referenced an author in the `authors` table, we should name that column `author_id`.
-. Keep table names plural, when possible -- for example, not the `book` table, but the `books` table.
-. Link tables or junction tables should be named by the two tables which you are trying to represent the many-to-many relationship for. (We will go over this one specifically when needed, no worries)
-
-Make the appropriate changes to the following `CREATE TABLE` statements that reflect these conventions as much as possible (for now).
-
-[source,ipython]
-----
-%%bash
-
-rm $HOME/my.db
-sqlite3 $HOME/my.db "CREATE TABLE reviews (
- user_id TEXT,
- book_id INTEGER,
- review_id TEXT,
- rating INTEGER,
- review_text TEXT,
- date_added TEXT,
- date_updated TEXT,
- read_at TEXT,
- started_at TEXT,
- n_votes INTEGER,
- n_comments INTEGER
-);"
-
-sqlite3 $HOME/my.db "CREATE TABLE books (
- isbn TEXT,
- text_reviews_count INTEGER,
- country_code TEXT,
- language_code TEXT,
- asin TEXT,
- is_ebook INTEGER,
- average_rating REAL,
- kindle_asin TEXT,
- description TEXT,
- format TEXT,
- link TEXT,
- publisher TEXT,
- num_pages INTEGER,
- publication_day INTEGER,
- isbn13 TEXT,
- publication_month INTEGER,
- edition_information TEXT,
- publication_year INTEGER,
- url TEXT,
- image_url TEXT,
- book_id TEXT,
- ratings_count INTEGER,
- work_id TEXT,
- title TEXT,
- title_without_series TEXT
-);"
-
-sqlite3 $HOME/my.db "CREATE TABLE authors (
- average_rating REAL,
- author_id INTEGER,
- text_reviews_count INTEGER,
- name TEXT,
- ratings_count INTEGER
-);"
-
-sqlite3 $HOME/my.db "CREATE TABLE series (
- numbered INTEGER,
- note TEXT,
- description TEXT,
- title TEXT,
- series_works_count INTEGER,
- series_id INTEGER,
- primary_work_count INTEGER
-);"
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-A book can have many authors, and an author can have many books. This is an example of a many-to-many relationship.
-
-We already have a `books` table and an `authors` table. Create a _junction_ or _link_ table that effectively _normalizes_ the `authors` **field** in the `books` table. Call this new table `books_authors` (see point 4 above -- this is the naming convention we want).
-
-Make sure to include your `CREATE TABLE` statement in your notebook.
-
-[TIP]
-====
-There should be 4 columns in the `authors_books` table. A primary key field, two foreign key fields, and a regular data field that is a part of the original `authors` field data in the `books` table.
-====
-
-[IMPORTANT]
-====
-Make sure to properly apply the https://www.sqlitetutorial.net/sqlite-primary-key/[primary key] and https://www.sqlitetutorial.net/sqlite-foreign-key/[foreign key] keywords.
-====
-
-Write a SQL query to find every book by author with id 12345. It doesn't have to be perfect syntax, as long as the logic is correct. In addition, it won't be runnable, that is okay.
-
-[TIP]
-====
-You will need to use _joins_ and our junction table to perform this query.
-====
-
-Copy, paste, and update your `bash` cell with the `CREATE TABLE` statements to implement these changes. In a markdown cell, write out your SQL query.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Assume that a series can have many books and a book can be a part of many series. Perform the same operations as the previous problem (except for the query).
-
-What columns does the `books_series` table have?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-The remaining two fields that need to be dealt with are `similar_books` and `popular_shelves`. Choose _at least_ one of the two and do your best to come up with a good solution for the way we store the data. We will give hints for both below.
-
-For this question, please copy, paste, and update the `bash` cell with the `CREATE TABLE` statements. In addition, please include a markdown cell with a detailed explanation of _why_ you chose your solution, and provide at least 1 example of a query that _should_ work for your solution (like before, we are looking for logic, not syntax).
-
-**similar_books:**
-
-[TIP]
-====
-It is okay to have a link table that links rows from the same table!
-====
-
-[TIP]
-====
-There are always many ways to do the same thing. In our examples, we used link tables with their own `id` (primary key) in addition to multiple foreign keys. This provides the flexibility of later being able to add more fields to the link table, where it may even become useful all by itself.
-
-There is, however, a _technically_ better solution for a table that is simply a link table and nothing more. This would be where you have 2 columns, both foreign keys, and you create a _composite_ primary key, or a primary key that is represented by the unique combination of both foreign keys. This ensures that links are only ever represented once. Feel free to experiment with this if you want!
-====
-
-**popular_shelves:**
-
-[TIP]
-====
-You can create as many tables as you need.
-====
-
-[TIP]
-====
-After a bit of thinking, this one may not be too different than what you've already accomplished.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project04.adoc
deleted file mode 100644
index 36d5d6ad8..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project04.adoc
+++ /dev/null
@@ -1,195 +0,0 @@
-= TDM 40100: Project 4 -- 2022
-
-**Motivation:** The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a `sqlite3` database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, `sqlite3`, with the opportunity to use other skills you've built throughout the previous years.
-
-**Context:** In TDM 20100 (formerly STAT 29000), you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with `sqlite3` -- a powerful database engine. In this project (and following projects), we will branch into SQL and `sqlite3`-specific topics and techniques that you haven't yet had exposure to in The Data Mine.
-
-**Scope:** `sqlite3`, lmod, SQL
-
-.Learning Objectives
-****
-- Create your own sqlite3 database file.
-- Analyze a large dataset and formulate CREATE TABLE statements designed to store the data.
-- Run one or more queries to test out the end result.
-- Demonstrate the ability to normalize a series of database tables.
-- Wrangle and insert data into database.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_book_series.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_books.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json`
-
-== Questions
-
-This project is going to be a bit more open. The goal of this project is to take our dataset sample, and write Python code to insert it into our `sqlite3` database. There are a variety of ways this could be accomplished, and we will accept anything that works, with a few constraints.
-
-In the next project, we will run some experiments that will time insertion, project the time it would take to insert all of the data, adjust database settings, and, ultimately, create a final product that we can feel good about.
-
-=== Question 1
-
-As mentioned earlier -- the goal of this project is to insert the sample data into our database. Start by generating the sample data.
-
-[source,ipython]
-----
-%%bash
-
-rm -rf $HOME/goodreads_samples
-mkdir $HOME/goodreads_samples
-cp /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json $HOME/goodreads_samples/
-cp /anvil/projects/tdm/data/goodreads/goodreads_book_series.json $HOME/goodreads_samples/
-shuf -n 27450 /anvil/projects/tdm/data/goodreads/goodreads_books.json > $HOME/goodreads_samples/goodreads_books.json
-shuf -n 98375 /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json > $HOME/goodreads_samples/goodreads_reviews_dedup.json
-----
-
-In addition, go ahead and copy our empty database that is ready for you to insert data into.
-
-[source,ipython]
-----
-%%bash
-
-rm $HOME/my.db
-cp /anvil/projects/tdm/data/goodreads/my.db $HOME
-----
-
-You can run this as many times as you need in order to get a fresh start.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Write Python code that inserts the data into your database. Here are the constraints.
-
-. You should be able to fully recover the `book_cover` image from the database. This means you'll need to handle scraping the `image_url` and converting the image to `bytes` before inserting into the database.
-+
-[TIP]
-====
-Want some help to write the scraping code? Check out https://the-examples-book.com/projects/fall2022/30100-2022-project04#question-2[this 30100 question] for more guidance.
-====
-+
-. Your functions and code should ultimately operate on a single _row_ of the datasets. For instance:
-+
-[NOTE]
-====
-[source,python]
-----
-import json
-
-with open("/anvil/projects/tdm/data/goodreads/goodreads_books.json") as f:
- for line in f:
- print(line)
- parsed = json.loads(line)
- print(f"{parsed['isbn']=}")
- print(f"{parsed['num_pages']=}")
- break
-----
-====
-+
-Here, you can see that we can take a single row and do _something_ to it. Why do we want it to work this way? This makes it easy to break our dataset into chunks and perform operations in parallel, if we so choose (and we will, but not in this project).
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Demonstrate your database works by doing the following.
-
-. Fully recover a `book_cover` and display it in your notebook.
-+
-[NOTE]
-====
-[source,ipython]
-----
-%%bash
-
-rm $HOME/test.db || true
-sqlite3 $HOME/test.db "CREATE TABLE test (
- id INTEGER PRIMARY KEY AUTOINCREMENT,
- my_blob BLOB
-);"
-----
-
-[source,python]
-----
-import shutil
-import requests
-import os
-import uuid
-import sqlite3
-
-url = 'https://images.gr-assets.com/books/1310220028m/5333265.jpg'
-my_bytes = scrape_image_from_url(url)
-
-# insert
-conn = sqlite3.connect('/home/x-kamstut/test.db')
-cursor = conn.cursor()
-query = f"INSERT INTO test (my_blob) VALUES (?);"
-dat = (my_bytes,)
-cursor.execute(query, dat)
-conn.commit()
-cursor.close()
-
-# retrieve
-conn = sqlite3.connect('/home/x-kamstut/test.db')
-cursor = conn.cursor()
-
-query = f"SELECT * from test where id = ?;"
-cursor.execute(query, (1,))
-record = cursor.fetchall()
-img = record[0][1]
-tmp_filename = str(uuid.uuid4())
-with open(f"{tmp_filename}.jpg", 'wb') as file:
- file.write(img)
-
-from IPython import display
-display.Image(f"{tmp_filename}.jpg")
-----
-====
-+
-. Run a simple query to `SELECT` the first 5 rows of each table.
-+
-[NOTE]
-====
-[source,ipython]
-----
-%sql sqlite:////home/my-username/my.db
-----
-
-[source,ipython]
-----
-%%sql
-
-SELECT * FROM tablename LIMIT 5;
-----
-====
-+
-[IMPORTANT]
-====
-Make sure to replace "my-username" with your Anvil username, for example, x-kamstut is mine.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project05.adoc
deleted file mode 100644
index a2517c1ec..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project05.adoc
+++ /dev/null
@@ -1,316 +0,0 @@
-= TDM 40100: Project 5 -- 2022
-
-**Motivation:** Sometimes taking the time to simply do some experiments and benchmark things can be fun and beneficial. In this project, we are going to run some tests to see how various methods we try impact insertion performance with `sqlite`.
-
-**Context:** This is the next project in our "deepish" dive into `sqlite3`. Hint, its not really a deep dive, but its deeper than what we've covered before! https://fly.io has been doing a blog series on a truly deep dive: https://fly.io/blog/sqlite-internals-btree/, https://fly.io/blog/sqlite-internals-rollback-journal/, https://fly.io/blog/sqlite-internals-wal/, https://fly.io/blog/sqlite-virtual-machine/.
-
-**Scope:** `sqlite3`, lmod, SQL
-
-.Learning Objectives
-****
-- Learn about some of the constraints `sqlite3` has.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_book_series.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_books.json`
-- `/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json`
-
-== Questions
-
-=== Question 1
-
-The data we want to include in our `sqlite3` database is in need of wrangling prior to insertion. It is a fairly sizeable dataset -- let's start by creating our sample dataset so we can use it to estimate the amount of time it will take to create the full database.
-
-[source,python]
-----
-from pathlib import Path
-
-def split_json_to_n_parts(path_to_json: str, number_files: int, output_dir: str) -> None:
- """
- Given a str representing the absolute path to a `.json` file,
- `split_json` will split it into `number_files` `.json` files of equal size.
-
- Args:
- path_to_json: The absolute path to the `.json` file.
- number_files: The number of files to split the `.json` file into.
- output_dir: The absolute path to the directory where the split `.json`
- files are to be output.
-
- Returns:
- Nothing.
-
- Examples:
-
- This is the second test
- >>> test_json = '/anvil/projects/tdm/data/goodreads/test.json'
- >>> output_dir = f'{os.getenv("SCRATCH")}/p5testoutput'
- >>> os.mkdir(output_dir)
- >>> number_files = 2
- >>> split_json_to_n_parts(test_json, number_files, output_dir)
- >>> output_dir = Path(output_dir)
- >>> number_output_files = sum(1 for _ in output_dir.glob("*.json"))
- >>> shutil.rmtree(output_dir)
- >>> number_output_files
- 2
- """
- path_to_json = Path(path_to_json)
- num_lines = sum(1 for _ in open(path_to_json))
- group_amount = num_lines//number_files + 1
- with open(path_to_json, 'r') as f:
- part_number = 0
- writer = None
- for idx, line in enumerate(f):
- if idx % group_amount == 0:
- if writer:
- writer.close()
-
- writer = open(str(Path(output_dir) / f'{path_to_json.stem}_{part_number}.json'), 'w')
- part_number += 1
-
- writer.write(line)
-----
-
-[source,python]
-----
-output_dir = f'{os.getenv("HOME")}/goodreads_samples'
-shutil.rmtree(output_dir)
-os.mkdir(output_dir)
-number_files = 1
-split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_books.json', number_files, output_dir)
-split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_book_authors.json', number_files, output_dir)
-split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_book_series.json', number_files, output_dir)
-split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_reviews_dedup.json', number_files, output_dir)
-----
-
-Create the empty database.
-
-[source,ipython]
-----
-%%bash
-
-rm $HOME/my.db || true
-sqlite3 $HOME/my.db "CREATE TABLE reviews (
- id TEXT PRIMARY KEY,
- user_id TEXT,
- book_id INTEGER,
- rating INTEGER,
- review_text TEXT,
- date_added TEXT,
- date_updated TEXT,
- read_at TEXT,
- started_at TEXT,
- n_votes INTEGER,
- n_comments INTEGER
-);"
-
-sqlite3 $HOME/my.db "CREATE TABLE books (
- id INTEGER PRIMARY KEY,
- isbn TEXT,
- text_reviews_count INTEGER,
- country_code TEXT,
- language_code TEXT,
- asin TEXT,
- is_ebook INTEGER,
- average_rating REAL,
- kindle_asin TEXT,
- description TEXT,
- format TEXT,
- link TEXT,
- publisher TEXT,
- num_pages INTEGER,
- publication_day INTEGER,
- isbn13 TEXT,
- publication_month INTEGER,
- edition_information TEXT,
- publication_year INTEGER,
- url TEXT,
- image_url TEXT,
- ratings_count INTEGER,
- work_id TEXT,
- title TEXT,
- title_without_series TEXT
-);"
-
-sqlite3 $HOME/my.db "CREATE TABLE authors_books (
- id INTEGER PRIMARY KEY,
- author_id INTEGER,
- book_id INTEGER,
- role TEXT,
- FOREIGN KEY (author_id) REFERENCES authors (id),
- FOREIGN KEY (book_id) REFERENCES books (id)
-);"
-
-sqlite3 $HOME/my.db "CREATE TABLE books_series (
- id INTEGER PRIMARY KEY,
- book_id INTEGER,
- series_id INTEGER,
- FOREIGN KEY (book_id) REFERENCES books (id),
- FOREIGN KEY (series_id) REFERENCES series (id)
-);"
-
-sqlite3 $HOME/my.db "CREATE TABLE authors (
- id INTEGER PRIMARY KEY,
- average_rating REAL,
- text_reviews_count INTEGER,
- name TEXT,
- ratings_count INTEGER
-);"
-
-sqlite3 $HOME/my.db "CREATE TABLE shelves (
- id INTEGER PRIMARY KEY,
- name TEXT
-);"
-
-sqlite3 $HOME/my.db "CREATE TABLE books_shelves (
- id INTEGER PRIMARY KEY,
- shelf_id INTEGER,
- book_id INTEGER,
- count INTEGER,
- FOREIGN KEY (shelf_id) REFERENCES shelves (id),
- FOREIGN KEY (book_id) REFERENCES books (id)
-);"
-
-sqlite3 $HOME/my.db "CREATE TABLE series (
- id INTEGER PRIMARY KEY,
- numbered INTEGER,
- note TEXT,
- description TEXT,
- title TEXT,
- series_works_count INTEGER,
- primary_work_count INTEGER
-);"
-----
-
-Check out `/anvil/projects/tdm/data/goodreads/gr_insert.py`. Use the unix `time` function to execute the script and determine how long it took to run. Estimate the amount of time it will would take to insert the full dataset. To run the script in a bash cell, you would do something like.
-
-[source,ipython]
-----
-%%bash
-
-python3 /anvil/projects/tdm/data/goodreads/gr_insert.py 0
-----
-
-Where the single argument indicates which files to read in. In this first example, it will process all files ending in `_0`. When we further split the data into parts, this will help use point the script at certain subsets of the data.
-
-[IMPORTANT]
-====
-To keep things simplified, we are going to skip a few things that take more time. Mainly, scraping the images, and the authors_books, books_shelves, and books_series tables.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Typically, one way to speed things up is to throw more processing power at it. Let's use 2 processes instead of 1 to insert our data. Start with a fresh (empty) database, and reinsert your data but use `joblib` to use 2 processors. What happened?
-
-[TIP]
-====
-Copy `gr_insert.py` into the same directory as your notebook. Then, the following imports will work.
-
-[source,python]
-----
-from gr_insert import insert_all
-import joblib
-from joblib import Parallel
-from joblib import delayed
-----
-====
-
-[TIP]
-====
-https://joblib.readthedocs.io/en/latest/parallel.html[This] example should help.
-====
-
-[TIP]
-====
-To get started, split your data into parts as follows.
-
-[source,python]
-----
-output_dir = f'{os.getenv("HOME")}/goodreads_samples'
-shutil.rmtree(output_dir)
-os.mkdir(output_dir)
-number_files = 2
-split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_books.json', number_files, output_dir)
-split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_book_authors.json', number_files, output_dir)
-split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_book_series.json', number_files, output_dir)
-split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_reviews_dedup.json', number_files, output_dir)
-----
-====
-
-[TIP]
-====
-You should get an error talking about something being "locked".
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-`sqlite3`, by default, can only have 1 writer at a time. So even though we have two processes trying to insert data, `sqlite3` can't keep going. In our case, one of the processes got a "database locked" issue. That's a huge bummer, but at least we can run queries while data is being inserted, right? Let's give it a try.
-
-Start with a fresh database. Run the following command in a bash cell. This will spawn two processes that will try to connect to the database at the same time. The first process will be inserting data (like before). The second process will try to continually make silly `SELECT` queries for 1 minute.
-
-[source,ipython]
-----
-%%bash
-
-python3 gr_insert.py 0 &
-python3 gr_insert.py 0 read &
-----
-
-What happens?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-As you may have figured out, no, by default you cannot both read and write data to an `sqlite3` database concurrently. However, this is possible by activating the write ahead log (WAL), cool!
-
-Start with a fresh database again, figure out _how_ to activate the WAL, activate it, and repeat question 3. Does it work now?
-
-This is a pretty big deal, and makes `sqlite3` an excellent choice for any database that doesn't need to have fantastic, concurrent write performance. Things like blogs and other small data systems could easily be backed by `sqlite3`, no problem! It also means that if you have an application that is creating a lot of data very rapidly, it is possibly _not_ the best choice.
-
-The WAL is an actual file? Find the file, what is it named?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Read the 4 articles provided in the **context** at the top of the project. Write a short paragraph about what you learned. What was the thing you found most interesting? If you are interested, feel free to try and replicate some of the examples they demonstrate.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project06.adoc
deleted file mode 100644
index 944a5508c..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project06.adoc
+++ /dev/null
@@ -1,326 +0,0 @@
-= TDM 40100: Project 6 -- 2022
-
-== Looking sharp for fall break?
-
-**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this next series of projects.
-
-**Context:** We are about to dive straight into a series of projects that emphasize working with images (with other fun things mixed in). We will start out with a straightforward task, with testable results.
-
-**Scope:** Python, images
-
-.Learning Objectives
-****
-- Use `numpy` and `skimage` to process images.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/images/apple.jpg`
-
-== Questions
-
-=== Question 1
-
-We are going to use scikit-image to load up our image.
-
-[source,python]
-----
-from skimage import io, color
-from matplotlib.pyplot import imshow
-import numpy as np
-import hashlib
-from typing import Callable
-
-img = io.imread("/anvil/projects/tdm/data/images/apple.jpg")
-imshow(img)
-----
-
-If you take a look at `img`, you will find it is simply a `numpy.ndarrary`. If you were to take a look, you would find that it is currently represented by a 100x100x3 array. The first matrix represents the pixels red values, the send the pixels green values, and the third the blue values. For example, we could make our apple greener as follows.
-
-[source,python]
-----
-test = img.copy()
-test = test + [0,100,0]
-imshow(test)
-----
-
-In this project, we are going to sharpen this image using a technique called unsharp masking. In order to do this, we will need to first change our image representation from RGB (red, gree, blue) to LAB. The L in LAB stands for perceptual lightness. The a and b are used to represent the 4 unique colors of human vision: red, green, blue, and yellow. If we don't convert to this representation, our sharpening can distort the colors of our image, which is _not_ what we want. By using LAB, we can apply our transformation to _just_ the lightness (the L), and our colors won't appear distorted or changed.
-
-The following is an example of converting to LAB.
-
-[source,python]
-----
-img = color.rgb2lab(img)
-----
-
-To convert back, you can use the following.
-
-[source,python]
-----
-img = color.lab2rgb(img)*255
-----
-
-The reason for the 255 is because during the conversions the values are rescaled to between 0 and 1, and we will want that to be between 0 and 255 so we can export properly later on.
-
-The first step in creating our unsharp mask, and the task for question (1), is to create a _filter_ for our image. The _filter_ should be represented as a function, `my_filter`. `my_filter` should accept a single argument, `img`, which is a numpy ndarray, similar to `img` in our first provided snippet of code. `my_filter` should return a numpy ndarray that has been processed.
-
-[source,python]
-----
-def my_filter(img: np.ndarray) -> np.ndarray:
- """
- Given an ndarray representation of an image,
- apply a median blur to the image.
- """
- pass
-----
-
-Implement a _median_ filter, that takes a target pixel, gets all of the immediate neighboring pixels, and sets the target pixel to the median value of the neighbors, and itself (total of 9 pixels). Make sure that the pixels you are getting the median of are the _original_ pixels, not the already filtered pixels. To simplify things, completely ignore and copy over the border pixels so we don't have to worry about all of those edge cases. The sharpened image will have a 1 pixel border that is equivalent to the original.
-
-[TIP]
-====
-The following is an example of finding a median of multiple pixels.
-
-[source,python]
-----
-img = io.imread('/anvil/projects/tdm/data/images/apple.jpg')
-np.median(np.stack((img[0, 0, :], img[0, 50, :])), axis=0)
-----
-====
-
-[IMPORTANT]
-====
-. This is the most difficult question for this project, the rest should be quicker.
-. It may take a minute or so to run for images larger than our `apple.jpg` -- we aren't using any special prebuilt functions or optimizations, so it is a lot of looping.
-====
-
-[TIP]
-====
-To verify your filter is working properly, you can run the following code and make sure the hash is the same.
-
-[source,python]
-----
-img = io.imread("/anvil/projects/tdm/data/images/apple.jpg")
-img = color.rgb2lab(img)
-filtered = my_filter(img)
-filtered = color.lab2rgb(filtered)
-filtered = (filtered*255).astype('uint8')
-io.imsave("filtered.jpg", filtered)
-with open("filtered.jpg", "rb") as f:
- my_bytes = f.read()
-
-m = hashlib.sha256()
-m.update(my_bytes)
-m.hexdigest()
-----
-
-.output
-----
-9a5d9f62d52bcb96ea68a86dc1e3a6ae3a9715ff86476c4ccec3b11e4e7dde8e
-----
-
-To see the blur:
-
-[source,python]
-----
-imshow(filtered)
-----
-
-To see the blurred image normal scaled:
-
-[source,python]
-----
-from IPython import display
-display.Image("filtered.jpg")
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-The next step in the process is to create a _mask_. To create the mask, write a function called `create_mask` that accepts the image, a filter function (`my_filter` from question (1)), and `strength`. `create_mask` should return the image (as an ndarray).
-
-[source,python]
-----
-def create_mask(img: np.ndarray, filt: Callable, strength: float = 0.8) -> np.ndarray:
- """
- Given the original image, a filter function,
- and a strength value. Return a mask.
- """
- pass
-----
-
-The _mask_ is simple. Take the given image, apply the filter to the image, and subtract the resulting image from the original. Take that result, and multiple by `strength`. `strength` is a value typically between .2 and 2 that effects how strongly to sharpen the image.
-
-[TIP]
-====
-Test to make sure your result is correct by running the following.
-
-[source,python]
-----
-img = io.imread("/anvil/projects/tdm/data/images/apple.jpg")
-img = color.rgb2lab(img)
-mask = create_mask(img, my_filter, 2)
-mask = color.lab2rgb(mask)
-mask = (mask*255).astype('uint8')
-io.imsave("mask.jpg", mask)
-with open("mask.jpg", "rb") as f:
- my_bytes = f.read()
-
-m = hashlib.sha256()
-m.update(my_bytes)
-m.hexdigest()
-----
-
-.output
-----
-e6cd9badbcb779615834e734d65730e42ded4db2030e0377d5c85ea6399d191a
-----
-
-Take a look at the mask itself! This will help you understand _what_ the mask actually is.
-
-[source,python]
-----
-imshow(mask)
-----
-
-To see the properly scaled mask:
-
-[source,python]
-----
-from IPython import display
-display.Image("mask.jpg")
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-The final step is to _apply_ your mask to the original image! Write a function called `unsharp` that accepts an image (as an ndarry, like usual), and a `strength` and applies the algorithm!
-
-[source,python]
-----
-def unsharp(img: np.ndarray, strength: float = 0.8) -> np.ndarray:
- """
- Given the original image, and a strength value,
- return the sharpened image in numeric format.
- """
-
- def _create_mask(img: np.ndarray, filt: Callable, strength: float = 0.8) -> np.ndarray:
- """
- Given the original image, a filter function,
- and a strength value. Return a mask.
- """
- return (img - filt(img))*strength
-
-
- def _filter(img: np.ndarray) -> np.ndarray:
- """
- Given an ndarray representation of an image,
- apply a median blur to the image.
- """
-----
-
-How do you apply the full algorithm?
-
-. Create the mask using the `create_mask` function.
-. Add the result to the numeric form of the original image.
-
-That is pretty straightforward! Of course, you'll need to convert back to RGB before exporting, like normal, but it really isn't that bad!
-
-[TIP]
-====
-You can verify things are working as follows.
-
-[source,python]
-----
-img = io.imread("/anvil/projects/tdm/data/images/apple.jpg")
-sharpened = color.rgb2lab(img)
-sharpened = unsharp(sharpened, 2)
-sharpened = color.lab2rgb(sharpened)
-sharpened = (sharpened*255).astype('uint8')
-io.imsave("sharpened.jpg", sharpened)
-with open("sharpened.jpg", "rb") as f:
- my_bytes = f.read()
-
-m = hashlib.sha256()
-m.update(my_bytes)
-m.hexdigest()
-----
-
-.output
-----
-e6cd9badbcb779615834e734d65730e42ded4db2030e0377d5c85ea6399d191a
-----
-
-You can test to see what the sharpened image looks like as follows.
-
-[source,python]
-----
-imshow(sharpened)
-----
-
-Or the normally scaled image:
-
-[source,python]
-----
-from IPython import display
-display.Image("sharpened.jpg")
-----
-====
-
-[NOTE]
-====
-There are quite a few ways you could change this algorithm to get better or slightly different results.
-====
-
-[NOTE]
-====
-There is quite a bit of magic that happens during the `color.lab2rgb` conversion.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Find another image (it could be anything) and use your function to sharpen it. Mess with the strength parameter to see how it effects things. Show at least 1 before and after image.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5 (optional)
-
-Instead of using the median blur effect, you could use a different filter, like a Gaussian blur. If you Google a bit, you will find that there are premade (and probably much faster) functions to perform a Gaussian blur. Use the Gaussian blur in place of the median blur, and perform the unsharp mask. Are the results better or worse in your opinion?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project07.adoc
deleted file mode 100644
index f6b088b29..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project07.adoc
+++ /dev/null
@@ -1,206 +0,0 @@
-= TDM 40100: Project 7 -- 2022
-:page-mathjax: true
-
-**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this series of projects.
-
-**Context:** In the previous project, we learned to sharpen images using unsharp masking. In this project, we will perform edge detection using Sobel filtering.
-
-**Scope:** Python, images, JAX
-
-.Learning Objectives
-****
-- Process images using `numpy`, `skimage`, and `JAX`.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/images/apple.jpg`
-- `/anvil/projects/tdm/data/images/drward.jpg`
-
-== Questions
-
-=== Question 1
-
-Let's, once again, work with our `apple.jpg` image. In the previous project, we sharpened the image using unsharp masking. In this project, we are going to try to detect edges using a Sobel filter. The first step in this process is to convert our image from color to greyscale.
-
-There are a few ways to do this, we will use the luminosity method. Create a function called `to_greyscale` that accepts the image in numeric numpy ndarray form, and returns the modified image in the numeric numpy ndarray form.
-
-[NOTE]
-====
-The luminosity method of conversion takes into consideration that our eyes don't react to each color the same way. You can read about some of the other methods https://www.baeldung.com/cs/convert-rgb-to-grayscale[here].
-====
-
-$gray = \frac{(0.2989*R + 0.5870*G + 0.1140*B)}{255}$
-
-Confirm your function works.
-
-[source,python]
-----
-img = io.imread("/anvil/projects/tdm/data/images/apple.jpg")
-img = to_greyscale(img)
-io.imsave("grey.jpg", img)
-with open("grey.jpg", "rb") as f:
- my_bytes = f.read()
-
-m = hashlib.sha256()
-m.update(my_bytes)
-m.hexdigest()
-----
-
-.output
-----
-d3aac435526a98d5d8665c558a96b834b63e5f17531b6e197b14d3b527406970
-----
-
-To display the greyscale image using `imshow`, you must include the `cmap="gray"` option.
-
-[source,python]
-----
-imshow(img, cmap="gray")
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-The big picture with edge detection is detecting sudden changes in pixel intensities. A natural way to find changes is by using gradients/derivatives! That could be time consuming, hence the genius of the Sobel filter.
-
-Write a function called `estimate_gradients` that uses the https://en.wikipedia.org/wiki/Sobel_operator[Sobel filter] to estimate the gradients, `gx` and `gy`. `gx` is the gradient in the x direction and `gy` is the gradient in the y direction. `estimate_gradients` should accept the image and return both `gx` and `gy`.
-
-To calculate the estimated gradients, you must take a pixel and its eight neighbors, multiply them by a 3x3 "kernel", and sum the results. In a lot of ways, this is very similar to what you did manually in the previous project. However, this operation is much more popular -- so popular, it has a name -- https://en.wikipedia.org/wiki/Kernel_(image_processing)#Convolution[_convolution_].
-
-Read https://en.wikipedia.org/wiki/Sobel_operator[the Sobel operator] wikipedia page, and look at the provided kernels used to calculate the gradient estimates. Use https://jax.readthedocs.io/en/latest/_autosummary/jax.scipy.signal.convolve.html#jax-scipy-signal-convolve[this] function to calculate and return both `gx` and `gy`.
-
-[TIP]
-====
-You will want your resulting image to be the same dimesion as _before_ the convolve function.
-====
-
-[TIP]
-====
-You can verify your output.
-
-[source,python]
-----
-img = io.imread("/anvil/projects/tdm/data/images/apple.jpg")
-img = to_greyscale(img)
-gx, gy = estimate_gradients(img)
-io.imsave("gx.jpg", gx)
-with open("gx.jpg", "rb") as f:
- my_bytes = f.read()
-
-m = hashlib.sha256()
-m.update(my_bytes)
-print(m.hexdigest())
-
-io.imsave("gy.jpg", gy)
-with open("gy.jpg", "rb") as f:
- my_bytes = f.read()
-
-m = hashlib.sha256()
-m.update(my_bytes)
-print(m.hexdigest())
-----
-
-.output
-----
-966c8530c02913ccc44b922ce9b42e6b85679a743b5e44757dc88ec2adfd21af
-e06ff1ed6edb589887a52d7fe154b84a12495d0ab487045e26cb0b34fc0b5402
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-What we _really_ want is a single gradient that combines both `gx` and `gy`. We can obtain it using the following formula.
-
-$G = \sqrt{gx^2 + gy^2}$
-
-Alternatively, the following would work as well.
-
-$G = |gx| + |gy|$
-
-Decide which formula to implement. Bring everything you've written so far together into a single function called `get_edges`. `get_edges` should accept the image (as a numeric `np.ndarray`, and return the final result, a greyscale image with edges clearly defined.
-
-You can verify your solution with the following. Note that depending on which method you chose, the resulting hash will be different. We've included both possibilities.
-
-Which method did you choose and why?
-
-[source,python]
-----
-img = io.imread("/anvil/projects/tdm/data/images/apple.jpg")
-img = get_edges(img)
-io.imsave("edge.jpg", img)
-with open("edge.jpg", "rb") as f:
- my_bytes = f.read()
-
-m = hashlib.sha256()
-m.update(my_bytes)
-m.hexdigest()
-----
-
-.output options
-----
-6386859f42d9d7664b79d75f2b375058c1d0a61defb9a055caaaa69ad95504ad
-3ac023a3900013e000e40812b96f7c120edd921cc483cec2f3d0d547a6e2675b
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-The Sobel filter is very effective, but like most things, has flaws. One such flaw is the sensitivity to noise. There are some ways around that.
-
-- You could threshold the output. If G is less than a certain value, you can force the value to be 0.
-- You can apply another filter to blur the image _prior_ to calculating the gradient estimates (just like we did with the median filter in the previous project!).
-
-Create two new functions: `get_edgesv1`, and `get_edgesv2`. Version 1 should use the cutoff method and version 2 should use the blur method.
-
-This question will be graded by looking at the outputted images, since there are many variations of possible result. Play around with the cutoff value in version 1. For version 2, please feel free to use our new `convolve` function to use a _mean_ instead of median blur.
-
-[TIP]
-====
-The `convolve` function makes it _super_ easy to apply a mean blur. Think about what `convolve` does and you should be able to figure out how to create a mean blur really quickly.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 5
-
-Apply your favorite edge detection function that you've built to a new image. How did it work? Why did you like the edge detection function you chose best? Write 1-2 sentences about your choice, and make sure to show the results of your image.
-
-Feel free to use `/anvil/projects/tdm/data/images/coke.jpg` -- the results are pretty neat!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project08.adoc
deleted file mode 100644
index b75a70f58..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project08.adoc
+++ /dev/null
@@ -1,228 +0,0 @@
-= TDM 40100: Project 8 -- 2022
-
-**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this series of projects.
-
-**Context:** In the previous project, we worked with images and implemented edge detection, with some pretty cool results! In these next couple of projects, we will continue to work with images as we learn how to compress images from scratch. This is the first in a series of 2 projects where we will implement a variation of jpeg image compression!
-
-**Scope:** Python, images, JAX
-
-.Learning Objectives
-****
-- Process images using `numpy`, `skimage`, and `JAX`.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/images/drward.jpg`
-- `/anvil/projects/tdm/data/images/barn.jpg`
-
-== Questions
-
-=== Question 1
-
-JPEG is a _lossy_ compression format and an example of transform based compression. Lossy compression means that you can't retrieve the information that was lost during the compression process. In a nutshell, these methods use statistics to identify and discard redundant data.
-
-For this project, we will start by reading in a larger image (than our previously used apple image): `/anvil/projects/tdm/data/images/drward.jpg`.
-
-[source,python]
-----
-from skimage import io, color
-from matplotlib.pyplot import imshow
-import numpy as np
-import jax
-import jax.numpy as jnp
-import hashlib
-from IPython import display
-
-img = io.imread("/anvil/projects/tdm/data/images/drward.jpg")
-----
-
-By default, our image is read in as an RGB image, where each pixel is represented as a value between 0 and 255, where the first value represents "red", the second "green", and the third "blue".
-
-In order to implement our compression algorithm, we need to change the representation of the image to the https://en.wikipedia.org/wiki/YCbCr[YCbCr] color space. Use the https://scikit-image.org/docs/stable/api/skimage.color.html[scikit-image] library we've used in previous projects to convert to the new color space. What are the dimensions now?
-
-Check out the 3-image example https://en.wikipedia.org/wiki/YCbCr[here] (the barn). Replicate this image by splitting `/anvil/projects/tdm/data/images/barn.jpg` into its YCbCr components and display them. Do the same for our `drward.jpg`.
-
-[TIP]
-====
-To display the YCbCr Y component, you will need to set the Cb and Cr components to 127. To display the Cb component, you will need to set the Cr and Y components to 127, etc. You can confirm the results by looking at your `barn.jpg` components and seeing if they look the same as the wikipedia page we linked above.
-====
-
-.Items to submit
-====
-- Display containing the 3 Y, Cb, and Cr components for both `drward.jpg` and `barn.jpg`, for a total of 6 images.
-====
-
-=== Question 2
-
-Our eyes are more sensitive to luminance than to color. As you can tell from the previous question, the Y component captures the luminance, and contains the majority of the image detail that is so important to our vision. The other Cb and Cr components are essentially just color components, and our eyes aren't as sensitive to changes in those components. Since our eyes aren't as sensitive, we don't need to capture that data as accurately, and is an opportunity to reduce what we store!
-
-Let's perform an experiment that makes this explicitly clear, as well as takes us 1 more step in the direction of having a compressed image.
-
-Downsample the Cb and Cr components and display the resulting image. There are a variety of ways to do this, but the one we will use right now is essentially to just round the values to the nearest rounded value of a certain factor. For instance, maybe we only want to represent values between 150 and 160 as 150 _or_ 160. So 151.111 becomes 150. 155.555 becomes 160. This could be done as follows.
-
-[source,python]
-----
-10*np.round(img/10)
-----
-
-Or, if you wanted more granularity, you could do.
-
-[source,python]
-----
-2*np.round(img/2)
-----
-
-Ultimately, let's refer to this as our _factor_.
-
-[source,python]
-----
-factor*np.round(img/factor)
-----
-
-Downsample the Cb and Cr components using a factor of 10 and display the resulting image.
-
-[TIP]
-====
-Here is some maybe-useful skeleton code.
-
-[source,python]
-----
-img = io.imread("/anvil/projects/tdm/data/images/barn.jpg")
-img = color.rgb2ycbcr(img)
-# create "dimg" that contains the downsampled Cb and Cr components
-dimg = color.ycbcr2rgb(dimg)
-io.imsave("dcolor.jpg", dimg, quality=100)
-with open("dcolor.jpg", "rb") as f:
- my_bytes = f.read()
-
-m = hashlib.sha256()
-m.update(my_bytes)
-print(m.hexdigest())
-display.Image("dcolor.jpg")
-----
-
-"dcolor" is just a name we chose to mean downsampled color, as in, we've downsampled the color components.
-
-The hash should be the following.
-
-----
-7bf01998d636ac71553f6d82da61a784ce50d2ab9f27c67fd16243bf1634583b
-----
-====
-
-Fantastic! Can you tell a difference by just looking at the original image and the color-downsampled image?
-
-Okay, let's perform the _same_ operation, but this time, instead of downsampling the Cr and Cb components, let's downsample the Y component (and _only_ the Y component). Downsample using a factor of 10. Display the new image. Can you tell a difference by just looking at the original image and the luminance-downsampled image?
-
-[TIP]
-====
-The hash for the luminance downsampled image should be the following.
-
-----
-dff9e0688d4367d30aa46615e10701f593f1d283314c039daff95c0324a4424d
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-The previous question was pretty cool (_at least in my opinion_)! It really demonstrates how our brains are much better at perceiving changes in luminance vs. color.
-
-Downsampling is an important step in the process. In the previous question, we essentially learned that we can remove color detail by a factor of 10 and not see a difference!
-
-The next step in our compression process is to convert our image data into numeric frequency data using a discrete cosine transform. This data representation allows us to quantify what data from the image is important, and what is less important. Lower frequency components are more important, and higher are less important can essentially be considered "noise".
-
-Create a new function called `dct2` that uses https://docs.scipy.org/doc/scipy/reference/generated/scipy.fftpack.dct.html[scipys dct] function, but performs the same operation over axis 0, and then over axis 1. Use `norm="ortho"`.
-
-[TIP]
-====
-Test it out to verify things are working well.
-
-[source,python]
-----
-test = np.array([[1,2,3],[3,4,5],[5,6,7]])
-dct2(test)
-----
-
-.output
-----
-array([[ 1.20000000e+01, -2.44948974e+00, 4.44089210e-16],
- [-4.89897949e+00, 0.00000000e+00, 0.00000000e+00],
- [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-For each 8x8 block of pixels in each channel (Y, Cb, Cr), apply the transformation, creating an all new array of frequency data.
-
-[TIP]
-====
-To loop through 8x8 blocks using numpy, check out the results of the following loop.
-
-[source,python]
-----
-img = io.imread("/anvil/projects/tdm/data/images/barn.jpg")
-img = color.rgb2ycbcr(img)
-s = img.shape
-for i in np.r_[:s[0]:8]:
- print(np.r_[i:(i+8)])
-----
-====
-
-[TIP]
-====
-To verify your results, you can try the following. Note that `freq` is the result of applying the `dct2` function to each 8x8 block in the image.
-
-[source,python]
-----
-dimg = color.ycbcr2rgb(freq)
-io.imsave("dctimg.jpg", dimg, quality=100)
-with open("dctimg.jpg", "rb") as f:
- my_bytes = f.read()
-
-m = hashlib.sha256()
-m.update(my_bytes)
-print(m.hexdigest())
-display.Image("dctimg.jpg")
-----
-
-.output
-----
-e45dc2a1a832f97bbb3f230ffaf6688d7f50307d6e43020df262314e9dd577e5
-----
-====
-
-[TIP]
-====
-Another fun (?) way to test is to apply the `dct2` function to every 8x8 block of every channel twice. The resulting image should _kind of_ look like the original. This is because the inverse function is pretty close to the function itself. We will see this in the next project.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project09.adoc
deleted file mode 100644
index e7e159685..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project09.adoc
+++ /dev/null
@@ -1,370 +0,0 @@
-= TDM 40100: Project 9 -- 2022
-
-**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this series of projects.
-
-**Context:** In the previous project, we worked with images and implemented edge detection, with some pretty cool results! In these next couple of projects, we will continue to work with images as we learn how to compress images from scratch. This is the first in a series of 2 projects where we will implement a variation of jpeg image compression!
-
-**Scope:** Python, images, JAX
-
-.Learning Objectives
-****
-- Process images using `numpy`, `skimage`, and `JAX`.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/images/drward.jpg`
-- `/anvil/projects/tdm/data/images/barn.jpg`
-
-== Questions
-
-[NOTE]
-====
-Some helpful links that were really useful.
-
-- https://en.wikipedia.org/wiki/JPEG
-- https://en.wikipedia.org/wiki/Quantization_(image_processing)
-- https://home.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL17.pdf (if you are interested in Huffman coding)
-====
-
-=== Question 1
-
-In the previous project, we were able to isolate and display the various YCbCr components of our `barn.jpg` image. In addition, we were able to use the discrete cosine transformation to convert each of the channels of our image (Y, Cb, and Cr) to signal data.
-
-Per https://www.mathworks.com/help/images/discrete-cosine-transform.html[mathworks], the discrete cosine transform has the property that visually significant information about an image is concentrated in just a few coefficients of the resulting signal data. Meaning, if we are able to capture the majority of the visually-important data from just a few coefficients, there is a lot of opportunity to _reduce_ the amount of data we need to keep!
-
-Start from the end of the previous project. Load up some libraries.
-
-[source,python]
-----
-from skimage import io, color, viewer
-from matplotlib.pyplot import imshow
-import numpy as np
-import jax
-import jax.numpy as jnp
-import hashlib
-from IPython import display
-import scipy
-
-img = io.imread("/anvil/projects/tdm/data/images/barn.jpg")
-----
-
-In addition, load up your `dct2` function, and create a numpy ndarray called `freq` that holds the image data (for `barn.jpg`) converted using the discrete cosine transform.
-
-Let's take a step back, and clarify a couple things.
-
-. We will not _actually_ be compressing our image, but we will be demonstrating how we can store the images data with less space, and very little loss of image detail.
-. We will still use a simple method to estimate _about_ how much space we would save if we did compress our image.
-. We will display a "compressed" version of the image. What this means is that you will be able to view the jpeg _after_ it has lost the data it would normally lose during the compression process.
-
-Okay, begin by taking the original RGB `img` and displaying the first 8x8 block of data for each of the R, G, and B channels. Next, display the first 8x8 block of data for the Y, Cb, and Cr channels. Finally, use `dct2` to create the `freq` ndarray (like you did in the previous project). Display the first 8x8 block of the Y, Cb, and Cr channels after the transformation.
-
-[WARNING]
-====
-When we say "display 8x8 blocks" we do not mean show an image -- we mean show the numeric data in the form of a numpy array. An 8x8 numpy array printed out using `np.array_str` (see the next "important" box).
-====
-
-[IMPORTANT]
-====
-By default, numpy arrays don't print very nicely. Use `np.array_str` to "pretty" print your arrays.
-
-[source,python]
-----
-np.array_str(myarray, precision=2, suppress_small=True)
-----
-====
-
-[TIP]
-====
-To get you started, this would be how you print the R, G, and B channels first 8x8 block.
-
-[source,python]
-----
-img = io.imread("/anvil/projects/tdm/data/images/barn.jpg")
-print(np.array_str(img[:8, :8, 0], precision=2, suppress_small=True))
-print(np.array_str(img[:8, :8, 1], precision=2, suppress_small=True))
-print(np.array_str(img[:8, :8, 2], precision=2, suppress_small=True))
-----
-====
-
-[TIP]
-====
-The following are `dct2` and `idct2`.
-
-[source,python]
-----
-def dct2(x):
- out = scipy.fftpack.dct(x, axis=0, norm="ortho")
- out = scipy.fftpack.dct(out, axis=1, norm="ortho")
- return out
-----
-
-[source,python]
-----
-def idct2(x):
- out = scipy.fftpack.idct(x, axis=1, norm="ortho")
- out = scipy.fftpack.idct(out, axis=0, norm="ortho")
- return out
-----
-====
-
-[TIP]
-====
-If you did not complete the previous project, no worries, please check out question (5). This will provide you with code that lets you efficiently loop through 8x8 blocks for each channel. This is important for creating the `freq` array containing the signal data.
-
-[source,python]
-----
-img = io.imread("/anvil/projects/tdm/data/images/barn.jpg")
-
-# convert to YCbCr
-img = color.rgb2ycbcr(img)
-img = img.astype(np.int16)
-
-s = img.shape
-freq = np.zeros(s)
-
-for channel in range(3):
- for i in np.r_[:s[0]:8]:
- for j in np.r_[:s[1]:8]:
-
- # apply dct here
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The output should be 3 sets of 3 8x8 numeric numpy matrices.
-- The first matrix should be the R, G, and B channels.
-- The second matrix should be the Y, Cb, and Cr channels.
-- The third matrix should be the Y, Cb, and Cr channels after being converted to frequency data using `dct2`.
-====
-
-=== Question 2
-
-Take a close look at the final set of 8x8 blocks in the previous question -- the blocks _after_ the DCT was applied. You'll notice the top left corner value is much different than the rest. This is the _DC coefficiant_. The rest are called _AC coefficients_.
-
-We forgot an important step. _Before_ we apply the `dct2`, we need to shift the our data to be centered around 0 instead of 127. We can do this by subtracting 127 from every value _before_ applying DCT.
-
-Re-print the first 8x8 block of `freq` after centering -- do the results look much different? According to https://en.wikipedia.org/wiki/JPEG[wikipedia], this step reduces the dynamic range requirements in the DCT processing stage.
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- The output should be 1 set of 3 8x8 numeric numpy matrices.
-- The output should be very close to the third matrix from question (1), but we center the data _before_ applying dct.
-====
-
-=== Question 3
-
-Okay, great! The next step in this process is to quantize our `freq` signal data. You can read more about quantization https://en.wikipedia.org/wiki/Quantization_(image_processing)[here]. Apparently, the human brain is not very good at distinguishing changes in high frequency parts of our data, but good at distinguishing low frequency changes.
-
-We can use a quantization matrix to filter out the higher frequency data and maintain the lower frequency data. One of the more common quantization matrices is the following.
-
-[source,python]
-----
-q1 = np.array([[16,11,10,16,24,40,51,61],
- [12,12,14,19,26,28,60,55],
- [14,13,16,24,40,57,69,56],
- [14,17,22,29,51,87,80,62],
- [18,22,37,56,68,109,103,77],
- [24,35,55,64,81,104,113,92],
- [49,64,78,87,103,121,120,101],
- [72,92,95,98,112,100,103,99]])
-print(np.array_str(q1, precision=2, suppress_small=True))
-----
-
-[quote, , wikipedia]
-____
-The quantization matrix is designed to provide more resolution to more perceivable frequency components over less perceivable components (usually lower frequencies over high frequencies) in addition to transforming as many components to 0, which can be encoded with greatest efficiency.
-____
-
-Take the `freq` signal data and divide the first 8x8 block by the quantization matrix. Use `np.round` to immediately round the values to the nearest integer. Use `np.array_str` to once again, display the resulting, quantized 8x8 block, for each of the 3 channels.
-
-Wow! The results are interesting, and _this_ is where the _majority_ of the actual data loss (and compression) takes place. Let's take a minute to explain what would happen next.
-
-. The data would be encoded by first using https://en.wikipedia.org/wiki/Run-length_encoding[run-length encoding]
-. Then, the data would be encoded by using https://en.wikipedia.org/wiki/Huffman_coding[Huffman coding].
-+
-[NOTE]
-====
-The details are beyond this course, however, it is not _too_ inaccurate to say that the zeros essentially don't need to be stored anymore. So for our first 8x8 block, we went from needing to store about 64 values to only 1, for each channel for a total of 192 to 3.
-====
-+
-. The encoded data, and all of the information (huffman tables, quantization tables, etc.) needed to _reverse_ the process and _restore_ the image would be structure carefully and stored as a jpeg file.
-
-Then, when some goes to _open_ the image, the jpeg file contains all of the information needed to _reverse_ the process and the image is displayed!
-
-You may be wondering -- wait, you are saying we can take those super sparse matrices we just printed and get back to our original RGB values? Nope! But we can recover the "important stuff" that creates an image that looks visually identical to our original image! This would be, in effect, the same image we would see if we implemented the entire algorithm and displayed the resulting image!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Output should be 1 set of 3 8x8 matrices that apply the quantization matrix and rounding after dct.
-====
-
-=== Question 4
-
-Use the following `idct2` function (the inverse of `dct2`) and print out the first 8x8 for each channel _after_ the process has been inversed. Starting with the quantized `freq` data from the previous question, the inverse process would be the following.
-
-. Multiply by the quantization table.
-. Use the `idct2` function to reverse the dct.
-. Add 127 to the final result to undo the shift highlighted in question (2).
-
-Use `np.array_str` to print the first 8x8 block for each channel. Do the results look fairly close to the original YCbCr channel values? Impressive!
-
-[TIP]
-====
-[source,python]
-----
-def idct2(x):
- out = scipy.fftpack.idct(x, axis=1, norm="ortho")
- out = scipy.fftpack.idct(out, axis=0, norm="ortho")
- return out
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-- Output should be 1 set of 3 8x8 matrices that apply the quantization matrix, round, de-apply quantization matrix, perform inverse dct, and de-shift the values. These matrices should be _nearly_ the same as the original YCbCr values from question (1).
-====
-
-=== Question 5
-
-Let's put it all together! While we aren't fully implementing the compression algorithm, we _do_ implement the parts that cause loss (hence jpeg is a _lossy_ algorithm). Since we implement those parts, we should also be able to view the lossy version of the image to see if we can perceive a difference! In addition, we could also count the number of non-zero values in our image data _before_ we process anything, and re-count immediately after the quantization and rounding, where many zeros appear in our matrices. This will _quite roughly_ tell us the savings if we were to implement the entire algorithm!
-
-[TIP]
-====
-You can use https://numpy.org/doc/stable/reference/generated/numpy.count_nonzero.html#numpy.count_nonzero[np.count_nonzero] to count the non-zero values of an array.
-====
-
-For our `barn.jpg` image, walk through the entire algorithm (excluding the encoding parts). Reverse the process after quantization and rounding, all the way back to saving and displaying the lossy image. Since this has been a bit of a roller coaster project, we will provide some skeleton code for you to complete.
-
-[source,python]
-----
-img = io.imread("/anvil/projects/tdm/data/images/barn.jpg")
-
-# TODO: count the nonzero values before anything
-original_nonzero =
-
-q1 = np.array([[16,11,10,16,24,40,51,61],
- [12,12,14,19,26,28,60,55],
- [14,13,16,24,40,57,69,56],
- [14,17,22,29,51,87,80,62],
- [18,22,37,56,68,109,103,77],
- [24,35,55,64,81,104,113,92],
- [49,64,78,87,103,121,120,101],
- [72,92,95,98,112,100,103,99]]).astype(np.int16)
-
-# convert to YCbCr
-img = color.rgb2ycbcr(img)
-img = img.astype(np.int16)
-
-# TODO: shift values to center around 0, for each channel
-
-s = img.shape
-freq = np.zeros(s)
-
-# downsample <- from previous project
-img[:,:,1] = 2*np.round(img[:,:,1]/2)
-img[:,:,2] = 2*np.round(img[:,:,2]/2)
-
-# variable to store number of non-zero values
-nonzero = 0
-
-for channel in range(3):
- for i in np.r_[:s[0]:8]:
- for j in np.r_[:s[1]:8]:
-
- # Example: printing a 8x8 block
- # Note: this can (and should) be deleted
- print(freq[i:(i+8), j:(j+8), channel])
-
- # TODO: apply dct to current 8x8 block
-
-
- # TODO: apply quantization to current 8x8 block
-
-
- # TODO: round values of the current 8x8 block
-
-
- # TODO: increment our count of non-zero values
-
-
- # TODO: de-quantize the current 8x8 block
-
-
- # TODO: apply inverse dct to current 8x8 block
-
-
-
-# TODO: de-shift the values that were previous shifted, for each channel
-
-# convert back to RGB
-img = color.ycbcr2rgb(freq)
-
-# print the number of nonzero values immediately post-quantization
-print(f"Non-zero values: {nonzero}")
-
-# print the _very_ approximate reduction of space for this image
-print(f"Reduction: {nonzero/original_nonzero}")
-
-# multiply image by 255 to rescale values to be between 0 and 255 instead of 0 and 1
-img = img*255
-
-# TODO: clip values greater than 255 and set those values equal to 255
-
-# TODO: clip values less than 0 and set those values equal to 0
-
-# save the "compressed" image so we can display it
-# NOTE: The file won't _actually_ be compressed, but it will be visually identical to a compressed image
-# since the lossy parts of the algorithm (the parts of the algorithm where we lose "unimportant" pieces of data)
-# have already taken place.
-io.imsave("compressed.jpg", img, quality=100)
-with open("compressed.jpg", "rb") as f:
- my_bytes = f.read()
-
-m = hashlib.sha256()
-m.update(my_bytes)
-print(m.hexdigest())
-display.Image("compressed.jpg")
-----
-
-[source,python]
-----
-# display the original image, for comparison
-display.Image("/anvil/projects/tdm/data/images/barn.jpg")
-----
-
-[TIP]
-====
-The hash I got was the following.
-
-.hash
-----
-bc004579948c5b699b0df52eb69ce168147481a2430d828939cfa791f59783e7
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project10.adoc
deleted file mode 100644
index c4ec25ed3..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project10.adoc
+++ /dev/null
@@ -1,199 +0,0 @@
-= TDM 40100: Project 10 -- 2022
-
-**Motivation:** In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. For the remaining projects, we will be doing some scraping of housing data, and potentially: `sqlite3`, containerization, and analysis work as well.
-
-**Context:** This is the first in a series of 4 projects with a focus on web scraping that incorporates of variety of skills we've touched on in previous data mine courses. For this first project, we will start slow with a `selenium` review with a small scraping challenge.
-
-**Scope:** selenium, Python, web scraping
-
-.Learning Objectives
-****
-- Use selenium to interact with a web page prior to scraping.
-- Use selenium and xpath expressions to efficiently scrape targeted data.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Questions
-
-=== Question 1
-
-The following code provides you with both a template for configuring a Firefox browser selenium driver that will work on Anvil, as well as a straightforward example that demonstrates how to search web pages and elements using xpath expressions, and emulating a keyboard. Take a moment, run the code, and try to job your memory.
-
-[source,python]
-----
-import time
-from selenium import webdriver
-from selenium.webdriver.firefox.options import Options
-from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
-from selenium.webdriver.common.keys import Keys
-----
-
-[source,python]
-----
-firefox_options = Options()
-firefox_options.add_argument("--window-size=810,1080")
-# Headless mode means no GUI
-firefox_options.add_argument("--headless")
-firefox_options.add_argument("--disable-extensions")
-firefox_options.add_argument("--no-sandbox")
-firefox_options.add_argument("--disable-dev-shm-usage")
-
-driver = webdriver.Firefox(options=firefox_options)
-----
-
-[source,python]
-----
-# navigate to the webpage
-driver.get("https://purdue.edu/directory")
-
-# full page source
-print(driver.page_source)
-
-# get html element
-e = driver.find_element("xpath", "//html")
-
-# print html element
-print(e.get_attribute("outerHTML"))
-
-# isolate the search bar "input" element
-# important note: the following actually searches the entire DOM, not just the element e
-inp = e.find_element("xpath", "//input")
-
-# to start with the element e and _not_ search the entire DOM, you'd do the following
-inp = e.find_element("xpath", ".//input")
-print(inp.get_attribute("outerHTML"))
-
-# use "send_keys" to type in the search bar
-inp.send_keys("mdw")
-
-# just like when you use a browser, you either need to push "enter" or click on the search button. This time, we will press enter.
-inp.send_keys(Keys.RETURN)
-
-# We can delay the program to allow the page to load
-time.sleep(5)
-
-# get the table
-table = driver.find_element("xpath", "//table[@class='more']")
-
-# print the table content
-print(table.get_attribute("outerHTML"))
-----
-
-Use `selenium` to isolate and print out Dr. Ward's: alias, email, campus, department, and title.
-
-[TIP]
-====
-The `following-sibling` axis may be useful here -- see: https://stackoverflow.com/questions/11657223/xpath-get-following-sibling.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Use `selenium` and its `click` method to first click the "VIEW MORE" link and then scrape and print: other phone, building, office, qualified name, and url.
-
-Take a look at the page source -- do you think clicking "VIEW MORE" was needed in order to scrape that data? Why or why not?
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Okay, finally, we are building some tools to help us analyze the housing market. https://zillow.com has extremely rich data on homes for sale, for rent, and lots of land.
-
-Click around and explore the website a little bit. Note the following.
-
-. Homes are typically list on the right hand side of the web page in a 21x2 set of "cards", for a total of 40 homes.
-+
-[NOTE]
-====
-At least in my experimentation -- the last row only held 1 card and there was 1 advertisement card, which I consider spam.
-====
-. If you want to search for homes for sale, you can use the following link: `https://www.zillow.com/homes/for_sale/{search_term}_rb/`, where `search_term` could be any hyphen separated set of phrases. For example, to search Lafayette, IN, you could use: https://www.zillow.com/homes/for_sale/lafayette-in_rb.
-. If you want to search for homes for rent, you can use the following link: `https://www.zillow.com/homes/for_rent/{search_term}_rb/`, where `search_term` could be any hyphen separated set of phrases. For example, to search Lafayette, IN, you could use: https://www.zillow.com/for_rent/lafayette-in_rb.
-. If you load, for example, https://www.zillow.com/homes/for_rent/lafayette-in_rb and rapidly scroll down the right side of the screen where the "cards" are shown, it will take a fraction of a second for some of the cards to load. In fact, unless you scroll, those cards will not load and if you were to parse the page contents, you would not find all 40 cards are loaded. This general strategy of loading content as the user scrolls is called lazy loading.
-
-Write a function called `get_links` that, given a `search_term`, will return a list of property links for the given `search_term`. The function should both get all of the cards on a page, but cycle through all of the pages of homes for the query.
-
-[TIP]
-====
-The following was a good query that had only 2 pages of results.
-
-[source,python]
-----
-my_links = get_links("47933")
-----
-====
-
-[TIP]
-====
-You _may_ want to include an internal helper function called `_load_cards` that accepts the driver and scrolls through the page slowly in order to load all of the cards.
-
-https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python[This] link will help! Conceptually, here is what we did.
-
-. Get initial set of cards using xpath expressions.
-. Use `driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1])` to scroll to the last card that was found in the DOM.
-. Find cards again (now that more may have loaded after scrolling).
-. If no more cards were loaded, exit.
-. Update the number of cards we've loaded and repeat.
-====
-
-[TIP]
-====
-Sleep 2 seconds using `time.sleep(2)` between every scroll or link click.
-====
-
-[TIP]
-====
-After getting the links for each page, use `driver.delete_all_cookies()` to clear off cookies and help avoid captcha.
-====
-
-[TIP]
-====
-If you using the link from the "next page" button to get the next page, instead, use `next_page.click()` to click on the link. Otherwise, you may get a captcha.
-====
-
-[TIP]
-====
-Use something like:
-
-[source,python]
-----
-with driver as d:
- d.get(blah)
-----
-
-This way, after exiting the `with` scope, the driver will be properly closed and quit which will decrease the liklihood of you getting captchas.
-====
-
-[TIP]
-====
-For our solution, we had a `while True:` loop in the `_load_cards` function and in the `get_links` function and used the `break` command in an if statement to exit.
-====
-
-[TIP]
-====
-Need more help? Post in Piazza and I will help get you unstuck and give more hints.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
-
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project11.adoc
deleted file mode 100644
index 815ceef4e..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project11.adoc
+++ /dev/null
@@ -1,452 +0,0 @@
-= TDM 40100: Project 11 -- 2022
-
-**Motivation:** In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. For the remaining projects, we will be doing some scraping of housing data, and potentially: `sqlite3`, containerization, and analysis work as well.
-
-**Context:** This is the second in a series of 4 projects with a focus on web scraping that incorporates of variety of skills we've touched on in previous data mine courses. For this second project, we continue to build our suite of tools designed to scrape public housing data.
-
-**Scope:** selenium, Python, web scraping
-
-.Learning Objectives
-****
-- Use selenium to interact with a web page prior to scraping.
-- Use selenium and xpath expressions to efficiently scrape targeted data.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Questions
-
-=== Question 1
-
-If you did not complete the previous project, we will provide you with the code for the `get_links` function on Monday, November 14th, below.
-
-[source,python]
-----
-def get_links(search_term: str) -> list[str]:
- """
- Given a search term, return a list of web links for all of the resulting properties.
- """
- def _load_cards(driver):
- """
- Given the driver, scroll through the cards
- so that they all load.
- """
- cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]")
- while True:
- try:
- num_cards = len(cards)
- driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1])
- time.sleep(2)
- cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]")
- if num_cards == len(cards):
- break
- num_cards = len(cards)
- except StaleElementReferenceException:
- # every once in a while we will get a StaleElementReferenceException
- # because we are trying to access or scroll to an element that has changed.
- # this probably means we can skip it because the data has already loaded.
- continue
-
- links = []
- url = f"https://www.zillow.com/homes/for_sale/{'-'.join(search_term.split(' '))}_rb/"
-
- firefox_options = Options()
- # Headless mode means no GUI
- firefox_options.add_argument("--headless")
- firefox_options.add_argument("--disable-extensions")
- firefox_options.add_argument("--no-sandbox")
- firefox_options.add_argument("--disable-dev-shm-usage")
- driver = webdriver.Firefox(options=firefox_options)
-
- with driver as d:
- d.get(url)
- d.delete_all_cookies()
- while True:
- time.sleep(2)
- _load_cards(d)
- links.extend([e.get_attribute("href") for e in d.find_elements("xpath", "//a[@data-test='property-card-link' and @class='property-card-link']")])
- next_link = d.find_element("xpath", "//a[@rel='next']")
- if next_link.get_attribute("disabled") == "true":
- break
- url = next_link.get_attribute('href')
- d.delete_all_cookies()
- next_link.click()
-
- return links
-----
-
-There is a _lot_ of rich data on a home's page. If you want to gauge the housing market in an area or for a `search_term`, there are two pieces of data that could be particularly useful: the "Price history" and "Public tax history" components of the page.
-
-Check out https://zillow.com links for a couple different houses.
-
-Let's say you want to track the `date`, `event`, and `price` in a `price_history` table, and the `year`, `property_tax`, and `tax_assessment` in a `tax_history` table.
-
-Write 2 `CREATE TABLE` statements to create the `price_history` and `tax_history` tables. In addition, create a `houses` table where the `NUMBER_zpid` is the primary key, and `html`, which will store an HTML file. You can find the id in a house's link. For example, https://www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/ has the id `43641432_zpid`.
-
-Use `sqlite3` to create the tables in a database called `$HOME/houses.db`. You can do all of this from within Jupyter Lab.
-
-[source,ipython]
-----
-%sql sqlite:///$HOME/houses.db
-----
-
-[source,ipython]
-----
-%%sql
-
-CREATE TABLE ...
-----
-
-Run the following queries to confirm and show your table schemas.
-
-[source, sql]
-----
-PRAGMA table_info(houses);
-----
-
-[source, sql]
-----
-PRAGMA table_info(price_history);
-----
-
-[source, sql]
-----
-PRAGMA table_info(tax_history);
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Write a function called `link_to_blob` that takes a `link` and returns a `blob` of the HTML file.
-
-. Navigate to page.
-. Sleep 2 seconds.
-. Scroll so elements load up. (think "Price and tax history" and clicking "See complete tax history", and clicking "See complete price history", etc.)
-. Create a `.html` file and `write` the driver's `page_source` to the file.
-. Open the file in `rb` mode and use the `read` method to read the file into binary format. Return the binary format object.
-. Delete the `.html` file from step (1).
-. Quit the driver by calling `driver.quit()`.
-
-In addition, write a function called `blob_to_html` that accepts a blob (like what is returned from `link_to_blob`) and returns the string containing the HTML content.
-
-Demonstrate the functions by using `link_to_blob` to get the blob for a link, and then using `blob_to_html` to get the HTML content back from the returned value of `link_to_blob`.
-
-[IMPORTANT]
-====
-Just print the first 500 characters of the results of `blob_to_html` to avoid cluttering your output.
-====
-
-[NOTE]
-====
-If you are unsure how to do any of this -- please feel free to post in Piazza!
-====
-
-[TIP]
-====
-Here is some skeleton code. The structure provided here works well for the problem.
-
-[source,python]
-----
-import uuid
-import os
-
-def link_to_blob(link: str) -> bytes:
- def _load_tables(driver):
- """
- Given the driver, scroll through the cards
- so that they all load.
- """
- # find price and tax history element using xpath
- table = driver.find_element(...)
-
- # scroll the table into view
- driver.execute_script(...)
-
- # sleep 2 seconds
- time.sleep(2)
-
- try:
- # find the "See complete tax history" button (if it exists)
- see_more = driver.find_element(...)
-
- # click the button to reveal the rest of the history (if it exists)
- see_more.click()
-
- except NoSuchElementException:
- pass
-
- try:
- # find the "See complete price history" button (if it exists)
- see_more = driver.find_element(...)
-
- # click the button to reveal the rest of the history (if it exists)
- see_more.click()
-
- except NoSuchElementException:
- pass
-
- # create a .html file with a random name using the uuid package so there aren't collisions
- filename = f"{uuid.uuid4()}.html"
-
- # open the file
- with open(filename, 'w') as f:
- firefox_options = Options()
- # Headless mode means no GUI
- firefox_options.add_argument("--headless")
- firefox_options.add_argument("--disable-extensions")
- firefox_options.add_argument("--no-sandbox")
- firefox_options.add_argument("--disable-dev-shm-usage")
-
- driver = webdriver.Firefox(options=firefox_options)
- driver.get(link)
- time.sleep(2)
- _load_tables(driver)
-
- # write the page source to the file
- f.write(...)
- driver.quit()
-
- # open the file in read binary mode
- with open(filename, 'rb') as f:
- # read the binary contents that are ready to be inserted into a sqlite BLOB
- blob = f.read()
-
- # remove the file from the filesystem -- we don't need it anymore
- os.remove(filename)
-
- return blob
-----
-====
-
-[TIP]
-====
-Use this trick: https://the-examples-book.com/starter-guides/data-formats/xml#write-an-xpath-expression-to-get-every-div-element-where-the-string-abc123-is-in-the-class-attributes-value-as-a-substring for finding and clicking the “see more” buttons for the two tables. If you dig into the HTML youll see there is some text you can use to jump right to the two tables.
-
-To add to this, if instead of `@class, 'abc'` you use `text(), 'abc'` it will try to match the values between elements to "abc". For example, `//div[contains(text(), 'abc')]` will match `
abc
`.
-====
-
-[TIP]
-====
-Remember the goal of this problem is to click the "see more" buttons (if they exist on a given page), and then just save the whole HTML page and convert it to binary for storage.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Write functions that accept html content (as a string) and uses the `lxml.html` package to parse the HTML content and extract the various components for our `price_history` and `tax_history` tables.
-
-[TIP]
-====
-My functions returned list of lists since the `sqlite3` python package will accept that format in an `executemany` statement.
-====
-
-[TIP]
-====
-[source,python]
-----
-import lxml.html
-
-tree = lxml.html.fromstring(blob_to_html(my_blob))
-tree.xpath("blah")
-----
-====
-
-[TIP]
-====
-Here is some example output from my functions -- you do not need to match this if you have a better way to do it.
-
-[source,python]
-----
-my_blob = link_to_blob("https://www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/")
-get_price_history(blob_to_html(my_blob))
-----
-
-Where
-
-[source,python]
-----
-def blob_to_html(blob: bytes) -> str:
- return blob.decode("utf-8")
-----
-
-.output
-----
-[['11/9/2022', 'Price change', 275000],
- ['11/2/2022', 'Listed for sale', 289900],
- ['1/13/2000', 'Sold', 19000]]
-----
-
-[source,python]
-----
-my_blob = link_to_blob("https://www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/")
-get_tax_history(blob_to_html(my_blob))
-----
-
-.output
-----
-[[2021, 1344, 124511],
- [2020, 1310, 122792],
- [2019, 1290, 120031],
- [2018, 1260, 117793],
- [2017, 1260, 115370],
- [2016, 1252, 112997],
- [2015, 1262, 112212],
- [2014, 1277, 113120],
- [2013, 1295, 112920],
- [2012, 1389, 124535],
- [2011, 1557, 134234],
- [2010, 1495, 132251],
- [2009, 1499, 128776],
- [2008, 1483, 128647],
- [2007, 1594, 124900],
- [2006, 1608, 121900],
- [2005, 1704, 118400],
- [2004, 1716, 115000],
- [2003, 1624, 112900],
- [2002, 1577, 110300],
- [2000, 288, 15700]]
-----
-====
-
-[TIP]
-====
-Some skeleton hints if you want extra help. See discussion: https://piazza.com/class/l6usy14kpkk66n/post/lalzk6hi8ark
-
-[source,python]
-----
-def get_price_history(html: str):
- tree = lxml.html.fromstring(html)
- # xpath to find the price and tax history table
- # then, you can use the xpath `following-sibling::div` to find the `div` that directly follows the
- # price and tax history div (hint, look for "Price-and-tax-history" in the id attribute of a div element
- # after the "following-sibling::div" part, look for
elements with an id attribute
- trs = tree.xpath(...)
- values = []
- for tr in trs:
- # xpath on the "tr" to find td with an inner span. Use string methods to remove the $ and remove the ",", and to remove trailing whitespace
- price = tr.xpath(...)[2].text.replace(...).replace(...).strip()
-
- # if price is empty, make it None
- if price == '':
- price = None
-
- # append the values
- values.append([tr.xpath(...)[0].text, tr.xpath(...)[1].text, price])
-
- return values
-----
-====
-
-[TIP]
-====
-More skeleton code help, if wanted. See discussion: https://piazza.com/class/l6usy14kpkk66n/post/lalzk6hi8ark
-
-[source,python]
-----
-def get_tax_history(html: str):
- tree = lxml.html.fromstring(html)
- try:
- # find the 'Price-and-tax-history' div, then, the following-sibling::div, then a table element, then a tbody element
- tbody = tree.xpath("//div[@id='Price-and-tax-history']/following-sibling::div//table//tbody")[1]
- except IndexError:
- return None
- values = []
- # get the trs in the tbody
- for tr in tbody.xpath(".//tr"):
- # replace the $, ",", and "-", strip whitespace
- prop_tax = tr.xpath(...)[1].text.replace(...).replace(...).replace(...).strip()
- # if prop_tax is empty set to None
- if prop_tax == '':
- prop_tax = None
- # add the data, for the last item in the list, remove $ and ","
- values.append([int(tr.xpath(...)[0].text), prop_tax, int(tr.xpath(...)[2].text.replace(...).replace(...))])
-
- return values
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Write code that uses the `get_links` function to get a list of links for a `search_term`. Process each link in the list and insert the retrieved data into your `houses.db` database.
-
-Once complete, run a couple queries that demonstrate that the data was successfully inserted into the database.
-
-[TIP]
-====
-Here is some skeleton code to assist.
-
-[source,python]
-----
-import sqlite3
-from urllib.parse import urlsplit
-from tqdm.notebook import tqdm
-
-links = get_links("47933")
-
-# connect to database
-con = sqlite3.connect(...)
-for link in tqdm(links): # this shows a progress bar for assistance
-
- # use link_to_blob to get the blob
-
- # use urlsplit to extract the zpid from the link
-
- # add values to a tuple for insertion into the database
- to_insert = (linkid, blob)
-
- # get a cursor
- cur = con.cursor()
-
- # insert the data into the houses table using the cursor
-
- # get price history data to insert
- to_insert = get_price_history(blob_to_html(blob))
-
- # insert id into price history data
- for val in to_insert:
- val.insert(0, linkid)
-
- # insert the data into the price_history table using the cursor
-
- # prep the tax history data in the exact same way as price history
-
- # if there is tax history data, insert the ids just like before
-
- # insert the data into the tax_history table using the cursor
-
- # commit the changes
- con.commit()
-
-# close the connection
-con.close()
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project12.adoc
deleted file mode 100644
index c1f1ffbc6..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project12.adoc
+++ /dev/null
@@ -1,452 +0,0 @@
-= TDM 40100: Project 12 -- 2022
-
-**Motivation:** In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. For the remaining projects, we will be doing some scraping of housing data, and potentially: `sqlite3`, containerization, and analysis work as well.
-
-**Context:** This is the third in a series of 4 projects with a focus on web scraping that incorporates of variety of skills we've touched on in previous data mine courses. For this second project, we continue to build our suite of tools designed to scrape public housing data.
-
-**Scope:** playwright, Python, web scraping
-
-.Learning Objectives
-****
-- Use playwright to interact with a web page prior to scraping.
-- Use playwright and xpath expressions to efficiently scrape targeted data.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Questions
-
-=== Question 1
-
-This has been (maybe) a bit intense for a project series. This project is going to give you a little break and not give you _anything_ new to do, except changing the package we are using.
-
-`playwright` is a modern web scraping tool backed by Microsoft, that, like `selenium`, allows you to interact with a web page before scraping. `playwright` is not necessarily better (yet), however, it _is_ different, and actively maintained.
-
-Implement the `get_links`, and `link_to_blob` functions using `playwright` instead of `selenium`. You can find the documentation for `playwright` xref:https://playwright.dev/python/docs/intro[here].
-
-Before you get started, you will need to run the following in a `bash` cell.
-
-[source,ipython]
-----
-%%bash
-
-python3 -m playwright install
-----
-
-Finally, we aren't going to force you to fight with the `playwright` documentation to get started, so the following is an example of code that will run in a Jupyter notebook, _and_ perform many of the basic/same operations you are acustomed to with `selenium`.
-
-[source,python]
-----
-import time
-import asyncio
-from playwright.async_api import async_playwright
-
-# so we can run this from within Jupyter, which is already async
-import nest_asyncio
-nest_asyncio.apply()
-
-async def main():
- async with async_playwright() as p:
- browser = await p.firefox.launch(headless=True)
- context = await browser.new_context()
- page = await context.new_page()
- await page.goto("https://purdue.edu/directory")
-
- # print the page source
- print(await page.content())
-
- # get html element
- e = page.locator("xpath=//html")
-
- # print the inner html of the element
- print(await e.inner_html())
-
- # isolate the search bar "input" element
- inp = e.locator("xpath=.//input")
-
- # print the outer html, or the element and contents
- print(await inp.evaluate("el => el.outerHTML"))
-
- # fill the input with "mdw"
- await inp.fill("mdw")
- print(await inp.evaluate("el => el.outerHTML"))
-
- # find the search button and click it
- await page.locator("xpath=//a[@id='glass']").click()
-
- # We can delay the program to allow the page to load
- time.sleep(5)
-
- # find the table in the page with dr. wards content
- table = page.locator("xpath=//table[@class='more']")
-
- # print the table and contents
- print(await table.evaluate("el => el.outerHTML"))
-
- # find the alias, if a selector starts with // or .. it is assumed to be xpath
- print(await page.locator("//th[@class='icon-key']").evaluate("el => el.outerHTML"))
-
- # you can print an attribute
- print(await page.locator("//th[@class='icon-key']").get_attribute("scope"))
-
- # similarly, you can print an elements content
- print(await page.locator("//th[@class='icon-key']").inner_text())
-
- # you could use the regular xpath stuff, no problem
- print(await page.locator("//th[@class='icon-key']/following-sibling::td").inner_text())
-
- await browser.close()
-
-asyncio.run(main())
-----
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 2
-
-Implement the `get_links` function using `playwright`. Test it out so the exmaple below is the same (or close, listed houses may change).
-
-[TIP]
-====
-Here is the `selenium` version.
-
-[source,python]
-----
-def get_links(search_term: str) -> list[str]:
- """
- Given a search term, return a list of web links for all of the resulting properties.
- """
- def _load_cards(driver):
- """
- Given the driver, scroll through the cards
- so that they all load.
- """
- cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]")
- while True:
- try:
- num_cards = len(cards)
- driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1])
- time.sleep(2)
- cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]")
- if num_cards == len(cards):
- break
- num_cards = len(cards)
- except StaleElementReferenceException:
- # every once in a while we will get a StaleElementReferenceException
- # because we are trying to access or scroll to an element that has changed.
- # this probably means we can skip it because the data has already loaded.
- continue
-
- links = []
- url = f"https://www.zillow.com/homes/for_sale/{'-'.join(search_term.split(' '))}_rb/"
-
- firefox_options = Options()
- # Headless mode means no GUI
- firefox_options.add_argument("--headless")
- firefox_options.add_argument("--disable-extensions")
- firefox_options.add_argument("--no-sandbox")
- firefox_options.add_argument("--disable-dev-shm-usage")
- driver = webdriver.Firefox(options=firefox_options)
-
- with driver as d:
- d.get(url)
- d.delete_all_cookies()
- while True:
- time.sleep(2)
- _load_cards(d)
- links.extend([e.get_attribute("href") for e in d.find_elements("xpath", "//a[@data-test='property-card-link' and @class='property-card-link']")])
- next_link = d.find_element("xpath", "//a[@rel='next']")
- if next_link.get_attribute("disabled") == "true":
- break
- url = next_link.get_attribute('href')
- d.delete_all_cookies()
- next_link.click()
-
- return links
-----
-====
-
-[TIP]
-====
-Use the `set_viewport_size` function to change the browser's width to 960 and height to 1080.
-====
-
-[TIP]
-====
-Don't forget to `await` the async functions -- this is going to be the most likely source of errors.
-====
-
-[TIP]
-====
-Unlike in `selenium`, in `playwright`, you won't be able to do something like this:
-
-[source,python]
-----
-# wrong
-cards = page.locator("xpath=//article[starts-with(@id, 'zpid')]")
-len(cards) # get the number of cards found
-----
-
-Instead, you'll have to use the useful https://playwright.dev/docs/api/class-locator#locator-count[`count`] function to get the nth element in the list of cards.
-
-[source,python]
-----
-cards = page.locator("xpath=//article[starts-with(@id, 'zpid')]")
-num_cards = await cards.count()
-----
-====
-
-[TIP]
-====
-Unlike in `selenium`, in `playwright`, you won't be able to do something like this:
-
-[source,python]
-----
-# wrong
-cards = page.locator("xpath=//article[starts-with(@id, 'zpid')]")
-await cards[num_cards-1].scroll_into_view_if_needed()
-----
-
-Instead, you'll have to use the useful https://playwright.dev/docs/api/class-locator#locator-nth[`nth`] function to get the nth element in the list of cards.
-
-[source,python]
-----
-cards = page.locator("xpath=//article[starts-with(@id, 'zpid')]")
-await cards.nth(num_cards-1).scroll_into_view_if_needed()
-----
-====
-
-[TIP]
-====
-To clear cookies, search for "cookie" in the playwright documentation. Hint: you can clear cookies using the context object.
-====
-
-[TIP]
-====
-This following provides a working skeleton to run the asynchronous code in Jupyter.
-
-[source,python]
-----
-import time
-import asyncio
-from playwright.async_api import async_playwright, expect
-
-import nest_asyncio
-nest_asyncio.apply()
-
-async def get_links(search_term: str) -> list[str]:
- """
- Given a search term, return a list of web links for all of the resulting properties.
- """
- async def _load_cards(page):
- """
- Given the driver, scroll through the cards
- so that they all load.
- """
- pass
-
- links = []
- url = f"https://www.zillow.com/homes/for_sale/{'-'.join(search_term.split(' '))}_rb/"
- async with async_playwright() as p:
- browser = await p.firefox.launch(headless=True)
- context = await browser.new_context()
- page = await context.new_page()
-
- # code
-
- time.sleep(10) # useful if using headful mode (not headless)
- await browser.close()
-
- return links
-
-my_links = asyncio.run(get_links("47933"))
-----
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 3
-
-Implement the `link_to_blob` function using `playwright`. Test it out so the example below functions.
-
-[TIP]
-====
-The `selenium` version will be posted below on Monday, November 21.
-
-[source,python]
-----
-import uuid
-import os
-
-def link_to_blob(link: str) -> bytes:
- def _load_tables(driver):
- """
- Given the driver, scroll through the cards
- so that they all load.
- """
- table = driver.find_element("xpath", "//div[@id='Price-and-tax-history']")
- driver.execute_script('arguments[0].scrollIntoView();', table)
- time.sleep(2)
- try:
- see_more = driver.find_element("xpath", "//span[contains(text(), 'See complete tax history')]")
- see_more.click()
- except NoSuchElementException:
- pass
- try:
- see_more = driver.find_element("xpath", "//span[contains(text(), 'See complete price history')]")
- see_more.click()
- except NoSuchElementException:
- pass
-
- filename = f"{uuid.uuid4()}.html"
- with open(filename, 'w') as f:
- firefox_options = Options()
- # Headless mode means no GUI
- firefox_options.add_argument("--headless")
- firefox_options.add_argument("--disable-extensions")
- firefox_options.add_argument("--no-sandbox")
- firefox_options.add_argument("--disable-dev-shm-usage")
-
- driver = webdriver.Firefox(options=firefox_options)
- driver.get(link)
- time.sleep(2)
- _load_tables(driver)
- f.write(driver.page_source)
- driver.quit()
-
- with open(filename, 'rb') as f:
- blob = f.read()
-
- os.remove(filename)
-
- return blob
-----
-====
-
-[TIP]
-====
-The `get_price_history` and `get_tax_history` solutions will be posted below on Monday, Novermber 21.
-
-[source,python]
-----
-import lxml.html
-
-def get_price_history(html: str):
- tree = lxml.html.fromstring(html)
- trs = tree.xpath("//div[@id='Price-and-tax-history']/following-sibling::div//tr[@id]")
- values = []
- for tr in trs:
- price = tr.xpath(".//td/span")[2].text.replace("$", "").replace(",", "").strip()
- if price == '':
- price = None
- values.append([tr.xpath(".//td/span")[0].text, tr.xpath(".//td/span")[1].text, price])
-
- return values
-
-
-def get_tax_history(html: str):
- tree = lxml.html.fromstring(html)
- try:
- tbody = tree.xpath("//div[@id='Price-and-tax-history']/following-sibling::div//table//tbody")[1]
- except IndexError:
- return None
- values = []
- for tr in tbody.xpath(".//tr"):
- prop_tax = tr.xpath(".//td/span")[1].text.replace("$", "").replace(",", "").replace("-", "").strip()
- if prop_tax == '':
- prop_tax = None
- values.append([int(tr.xpath(".//td/span")[0].text), prop_tax, int(tr.xpath(".//td/span")[2].text.replace("$", "").replace(",", ""))])
-
- return values
-----
-====
-
-[TIP]
-====
-To test your code run the following.
-
-[source,python]
-----
-my_blob = asyncio.run(link_to_blob("https://www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/"))
-
-def blob_to_html(blob: bytes) -> str:
- return blob.decode("utf-8")
-
-get_price_history(blob_to_html(my_blob))
-----
-
-.output
-----
-[['11/9/2022', 'Price change', '275000'],
- ['11/2/2022', 'Listed for sale', '289900'],
- ['1/13/2000', 'Sold', '19000']]
-----
-
-[source,python]
-----
-my_blob = asyncio.run(link_to_blob("https://www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/"))
-
-def blob_to_html(blob: bytes) -> str:
- return blob.decode("utf-8")
-
-get_tax_history(blob_to_html(my_blob))
-----
-
-.output
-----
-[[2021, '1344', 124511],
- [2020, '1310', 122792],
- [2019, '1290', 120031],
- [2018, '1260', 117793],
- [2017, '1260', 115370],
- [2016, '1252', 112997],
- [2015, '1262', 112212],
- [2014, '1277', 113120],
- [2013, '1295', 112920],
- [2012, '1389', 124535],
- [2011, '1557', 134234],
- [2010, '1495', 132251],
- [2009, '1499', 128776],
- [2008, '1483', 128647],
- [2007, '1594', 124900],
- [2006, '1608', 121900],
- [2005, '1704', 118400],
- [2004, '1716', 115000],
- [2003, '1624', 112900],
- [2002, '1577', 110300],
- [2000, '288', 15700]]
-----
-
-Please note that exact numbers may change slightly, that is okay! Prices and things change.
-====
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-=== Question 4
-
-Test out a `playwright` feature from the https://playwright.dev/python/docs/intro[documentation] that is new to you. This could be anything. One suggestion that could be interesting would be screenshots. As long as you demonstrate _something_ new, you will receive credit for this question. Have fun, and happy thanksgiving!
-
-.Items to submit
-====
-- Code used to solve this problem.
-- Output from running the code.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project13.adoc
deleted file mode 100644
index 2219f8119..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project13.adoc
+++ /dev/null
@@ -1,58 +0,0 @@
-= TDM 40100: Project 13 -- 2022
-
-**Motivation:** It has been a long semester! In this project, we want to give you some flexibility to explore and utilize some of the skills you've previously learned in the course. You will be given 4 options to choose from. Please note that we do not expect perfect submissions, but rather a strong effort in line with a typical project submission.
-
-**Context:** This is the final project for TDM 40100, where you will choose from 4 options that each exercise some skills from the semester and more.
-
-**Scope:** Python, sqlite3, playwright, selenium, pandas, matplotlib, and more.
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Questions
-
-Choose from one of the following options:
-
-=== Option 1
-
-Use the provided functions and your sqlite skills to scrape and store 1000+ homes in an area of your choice. Use the data you stored in your database to perform an analysis of your choice. Examples of potentially interesting questions you could ask:
- - What percentage of homes have "fishy" histories? For example, a home for sale on the market for too long is viewed as "bad". You may notice homes being marked as "sold" and immediately put back on the market. This refreshes Zillow's data and makes it look like the home is new to the market, when in fact it is not.
- - For your area, what is the average time on the market before the home is sold? What is the average price drop, and after how many days does the price drop occur?
-
-=== Option 2
-
-Use the provided functions and libraries like `argparse` or https://typer.tiangolo.com/[`typer`] to build a CLI to make zillow queries and display data. Please incorporate at least 1 of the following "extra" features:
- - Color your output using `rich`.
- - Or containerize your application using https://docs.sylabs.io/guides/3.5/user-guide/build_a_container.html#building-containers-from-singularity-definition-files[singularity].
- - Or use `sqlite3` to cache the HTML blobs -- if the blob for a home or query is not older than 1 day, then use the cached version instead of making a new request.
-
-=== Option 3
-
-Abandon the housing altogether and instead have some FIFA fun. Scrape data from https://fbref.com/en/ and choose from two very similar projects.
- - Write `playwright` or `selenium` code to scrape data from https://fbref.com. Scrape 1000+ structured pieces of information and store it in a database to perform an analysis of your choice. Examples could be:
- - Can you find any patterns that may indicate promising players under the age of 21 by looking at currently successful players data when they were young?
- - What country produces the most talent (by some metric you describe)?
-
- - Build a CLI to make queries and display data. Please incorporate at least 1 of the following "extra" features:
- - Color your output using `rich`.
- - Or containerize your application using https://docs.sylabs.io/guides/3.5/user-guide/build_a_container.html#building-containers-from-singularity-definition-files[singularity].
- - Or use `sqlite3` to cache the HTML blobs -- if the blob for a home or query is not older than 1 day, then use the cached version instead of making a new request.
-
-=== Option 4
-
-Have another idea that utilizes the same skillsets? Please post it in Piazza to get approval from Kevin or Dr. Ward.
-
-.Items to submit
-====
-- A markdown cell describing the option(s) you chose to complete for this project and why you chose it/them.
-- If you chose to scrape 1000+ bits of data, 2 SQL cells: 1 that demonstrates a sample of your data (for instance 5 rows printed out), and 1 that shows that you've scraped 1000+ records.
-- If you chose to scrape 1000+ bits of data, an analysis complete with your problem statement, how you chose to solve the problem, and your code and analysis, with at least 2 included visualizations, and a conclusion.
-- If you chose to build a CLI, a markdown cell describing the CLI and how to use it, and the options it has.
-- Screenshots demonstrating the capabilities of your CLI and the extra feature(s) you chose to implement.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-projects.adoc
deleted file mode 100644
index fdb4ca147..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-projects.adoc
+++ /dev/null
@@ -1,41 +0,0 @@
-= TDM 40100
-
-== Project links
-
-[NOTE]
-====
-Only the best 10 of 13 projects will count towards your grade.
-====
-
-[CAUTION]
-====
-Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses.
-====
-
-[%header,format=csv,stripes=even,%autowidth.stretch]
-|===
-include::ROOT:example$40100-2022-projects.csv[]
-|===
-
-[WARNING]
-====
-Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete.
-
-**Always** double check that the work that you submitted was uploaded properly. See xref:current-projects:submissions.adoc[here] for more information.
-
-Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza.
-====
-
-== Piazza
-
-=== Sign up
-
-https://piazza.com/purdue/fall2022/tdm40100[https://piazza.com/purdue/fall2022/tdm40100]
-
-=== Link
-
-https://piazza.com/purdue/fall2022/tdm40100/home[https://piazza.com/purdue/fall2022/tdm40100/home]
-
-== Syllabus
-
-See xref:fall2022/logistics/syllabus.adoc[here].
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/logistics/office_hours.adoc b/projects-appendix/modules/ROOT/pages/fall2022/logistics/office_hours.adoc
deleted file mode 100644
index 9fb1c4f44..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/logistics/office_hours.adoc
+++ /dev/null
@@ -1,34 +0,0 @@
-= Office Hours Fall 2022
-
-[IMPORTANT]
-====
-Check here to find the most up to date office hour schedule.
-====
-
-[NOTE]
-====
-**Office hours _during_ seminar:** Hillenbrand C141 -- the atrium inside the dining court +
-**Office hours _outside_ of seminar, before 5:00 PM EST:** Hillenbrand Lobby C100 -- the lobby between the 2 sets of front entrances +
-**Office hours _after_ 5:00 PM EST:** Online in Webex +
-**Office hours on the _weekend_:** Online in Webex
-====
-
-Navigate between tabs to view office hour schedules for each course and find Webex links to online office hours.
-++++
-
-++++
-
-
-== About the Office Hours in The Data Mine
-
-During Fall 2022, office hours will be in person in Hillenbrand Hall during popular on-campus hours, and online via Webex during later hours (starting at 5:00PM). Each TA holding an online office hour will have their own WebEx meeting setup, so students will need to click on the appropriate WebEx link to join office hours. In the meeting room, the student and the TA can share screens with each other and have vocal conversations, as well as typed chat conversations. You will need to use the computer audio feature, rather than calling in to the meeting. There is a WebEx app available for your phone, too, but it does not have as many features as the computer version.
-
-The priority is to have a well-staffed set of office hours that meets student traffic needs. **We aim to have office hours when students need them most.**
-
-Each online TA meeting will have a maximum of 7 other people able to join at one time. Students should enter the meeting room to ask their question, and when their question is answered, the student should leave the meeting room so that others can have a turn. Students are welcome to re-enter the meeting room when they have another question. If a TA meeting room is full, please wait a few minutes to try again, or try a different TA who has office hours at the same time.
-
-Students can also use Piazza to ask questions. The TAs will be monitoring Piazza during their office hours. TAs should try and help all students, regardless of course. If a TA is unable to help a student resolve an issue, the TA might help the student to identify an office hour with a TA that can help, or encourage the student to post in Piazza.
-
-The weekly projects are due on Friday evenings at 11:55 PM through Gradescope in Brightspace. All the seminar times are on Mondays. New projects are released on Thursdays, so students have 8 days to work on each project.
-
-All times listed are Purdue time (Eastern).
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/logistics/schedule.adoc b/projects-appendix/modules/ROOT/pages/fall2022/logistics/schedule.adoc
deleted file mode 100644
index 290329188..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/logistics/schedule.adoc
+++ /dev/null
@@ -1,118 +0,0 @@
-= Fall 2022 Course Schedule - Seminar
-
-Assignment due dates are listed in *BOLD*. Other dates are important notes.
-
-*Remember, only your top 10 out of 13 project scores are factored into your final grade.
-
-[cols="^.^1,^.^3,<.^12"]
-|===
-
-|*Week* |*Date* ^.|*Activity*
-
-|1
-|8/22 - 8/26
-|Monday, 8/22: First day of fall 2022 classes
-
-
-
-|2
-|8/29 - 9/9
-|
-*Project #1 due on Gradescope by 11:59 PM ET on Friday, 9/9*
-
-*Syllabus Quiz due on Gradescope by 11:59 PM ET on Friday, 9/9*
-
-*Academic Integrity Quiz due on Gradescope by 11:59 PM ET on Friday, 9/9*
-
-*Project #2 due on Gradescope by 11:59 PM ET on Friday, 9/9*
-
-
-|3
-|9/5 - 9/9
-|Monday, 9/5: Labor Day, no classes
-
-
-
-|4
-|9/12 - 9/16
-|
-*Project #3 due on Gradescope by 11:59 PM ET on Friday, 9/16*
-
-
-
-|5
-|9/19 - 9/23
-|
-*Project #4 due on Gradescope by 11:59 PM ET on Friday, 9/23*
-*Outside Event #1 due on Gradescope by 11:59 PM ET on Friday, 9/23*
-
-
-|6
-|9/26 - 9/30
-| *Project #5 due on Gradescope by 11:59 PM ET on Friday, 9/30*
-
-
-|7
-|10/3 - 10/7
-|*Project #6 due on Gradescope by 11:59 PM ET on Friday, 10/7*
-
-
-|8
-|10/10 - 10/14
-|Monday & Tuesday, 10/10 - 10/11 October Break
-
-|9
-|10/17 - 10/21
-|
-*Project #7 due on Gradescope by 11:59 PM ET on Friday, 10/21*
-
-*Outside Event #2 due on Gradescope by 11:59 PM ET on Friday, 10/21*
-
-|10
-|10/24 - 10/28
-|
-*Project #8 due on Gradescope by 11:59 PM ET on Friday, 10/28*
-
-|11
-|10/31 - 11/4
-|
-*Project #9 due on Gradescope by 11:59 PM ET on Friday, 11/4*
-
-|12
-|11/7 - 11/11
-|
-*Project #10 due on Gradescope by 11:59 PM ET on Friday, 11/11*
-
-
-|13
-|11/14 - 11/18
-|
-*Project #11 due on Gradescope by 11:59 PM ET on Friday, 11/18*
-*Outside Event #3 due on Gradescope by 11:59 PM ET on Friday, 11/18*
-
-|14
-|11/21 - 11/25
-|Wednesday - Friday, 11/23 - 11/25: Thanksgiving Break
-
-
-|15
-|11/28 - 12/2
-|
-*Project #12 due on Gradescope by 11:59 PM ET on Friday, 12/2*
-
-|16
-|12/5 - 12/9
-|
-*Project #13 due on Gradescope by 11:59 PM ET on Friday, 12/9*
-
-|
-|12/12 - 12/16
-|Final Exam Week - There are no final exams in The Data Mine.
-
-
-|
-|12/20
-|Tuesday, 12/20: Fall 2022 grades are submitted to Registrar's Office by 5 PM Eastern
-
-
-|===
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/logistics/syllabus.adoc b/projects-appendix/modules/ROOT/pages/fall2022/logistics/syllabus.adoc
deleted file mode 100644
index 08648bc66..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/logistics/syllabus.adoc
+++ /dev/null
@@ -1,269 +0,0 @@
-= Fall 2022 Syllabus - The Data Mine Seminar
-
-== Course Information
-
-
-[%header,format=csv,stripes=even]
-|===
-Course Number and Title, CRN
-TDM 10100 - The Data Mine I, possible CRNs are 12067 or 12072 or 12073 or 12071
-TDM 20100 - The Data Mine III, possible CRNs are 12117 or 12106 or 12113 or 12118
-TDM 30100 - The Data Mine V, possible CRNs are 12104 or 12112 or 12115 or 12120
-TDM 40100 - The Data Mine VII, possible CRNs are 12103 or 12111 or 12114 or 12119
-TDM 50100 - The Data Mine Seminar, CRN 15644
-|===
-
-*Course credit hours:* 1 credit hour, so you should expect to spend about 3 hours per week doing work
-for the class
-
-*Prerequisites:*
-None for TDM 10100. All students, regardless of background are welcome. Typically, students new to The Data Mine sign up for TDM 10100, students in their second, third, or fourth years of The Data Mine sign up for TDM 20100, TDM 30100, and TDM 40100, respectively. TDM 50100 is geared toward graduate students. However, during the first week of the semester (only), if a student new to The Data Mine has several years of data science experience and would prefer to switch from TDM 10100 to TDM 20100, we can make adjustments on an individual basis.
-
-=== Course Web Pages
-
-- link:https://the-examples-book.com/[*The Examples Book*] - All information will be posted within
-- link:https://www.gradescope.com/[*Gradescope*] - All projects and outside events will be submitted on Gradescope
-- link:https://purdue.brightspace.com/[*Brightspace*] - Grades will be posted in Brightspace.
-- link:https://datamine.purdue.edu[*The Data Mine's website*] - helpful resource
-- link:https://ondemand.anvil.rcac.purdue.edu/[*Jupyter Lab via the On Demand Gateway on Anvil*]
-
-=== Meeting Times
-There are officially 4 Monday class times: 8:30 am, 9:30 am, 10:30 am (all in the Hillenbrand Dining Court atrium--no meal swipe required), and 4:30 pm (link:https://purdue-edu.zoom.us/my/mdward[synchronous online], recorded and posted later). Attendance is not required.
-
-All the information you need to work on the projects each week will be provided online on the Thursday of the previous week, and we encourage you to get a head start on the projects before class time. Dr. Ward does not lecture during the class meetings, but this is a good time to ask questions and get help from Dr. Ward, the T.A.s, and your classmates. The T.A.s will have many daytime and evening office hours throughout the week.
-
-=== Course Description
-
-The Data Mine is a supportive environment for students in any major from any background who want to learn some data science skills. Students will have hands-on experience with computational tools for representing, extracting, manipulating, interpreting, transforming, and visualizing data, especially big data sets, and in effectively communicating insights about data. Topics include: the R environment, Python, visualizing data, UNIX, bash, regular expressions, SQL, XML and scraping data from the internet, as well as selected advanced topics, as time permits.
-
-=== Learning Outcomes
-
-By the end of the course, you will be able to:
-
-1. Discover data science and professional development opportunities in order to prepare for a career.
-2. Explain the difference between research computing and basic personal computing data science capabilities in order to know which system is appropriate for a data science project.
-3. Design efficient search strategies in order to acquire new data science skills.
-4. Devise the most appropriate data science strategy in order to answer a research question.
-5. Apply data science techniques in order to answer a research question about a big data set.
-
-=== Required Materials
-
-* A laptop so that you can easily work with others. Having audio/video capabilities is useful.
-* Brightspace and Gradescope course pages.
-* Access to Jupyter Lab at the On Demand Gateway on Anvil:
-https://ondemand.anvil.rcac.purdue.edu/
-* "The Examples Book": https://the-examples-book.com
-* Good internet connection.
-
-=== Attendance Policy
-
-Attendance is not required.
-
-When conflicts or absences can be anticipated, such as for many University-sponsored activities and religious observations, the student should inform the instructor of the situation as far in advance as possible.
-
-For unanticipated or emergency absences when advance notification to the instructor is not possible, the student should contact the instructor or TA as soon as possible by email or phone. When the student is unable to make direct contact with the instructor and is unable to leave word with the instructor’s department because of circumstances beyond the student’s control, and in cases falling under excused absence regulations, the student or the student’s representative should contact or go to the Office of the Dean of Students website to complete appropriate forms for instructor notification. Under academic regulations, excused absences may be granted for cases of grief/bereavement, military service, jury duty, parenting leave, and medical excuse. For details, see the link:https://catalog.purdue.edu/content.php?catoid=13&navoid=15965#a-attendance[Academic Regulations & Student Conduct section] of the University Catalog website.
-
-== How to succeed in this course
-
-If you would like to be a successful Data Mine student:
-
-* Start on the weekly projects on or before Mondays so that you have plenty of time to get help from your classmates, TAs, and Data Mine staff. Don't wait until the due date to start!
-* Be excited to challenge yourself and learn impressive new skills. Don't get discouraged if something is difficult--you're here because you want to learn, not because you already know everything!
-* Remember that Data Mine staff and TAs are excited to work with you! Take advantage of us as resources.
-* Network! Get to know your classmates, even if you don't see them in an actual classroom. You are all part of The Data Mine because you share interests and goals. You have over 800 potential new friends!
-* Use "The Examples Book" with lots of explanations and examples to get you started. Google, Stack Overflow, etc. are all great, but "The Examples Book" has been carefully put together to be the most useful to you. https://the-examples-book.com
-* Expect to spend approximately 3 hours per week on the projects. Some might take less time, and occasionally some might take more.
-* Don't forget about the syllabus quiz, academic integrity quiz, and outside event reflections. They all contribute to your grade and are part of the course for a reason.
-* If you get behind or feel overwhelmed about this course or anything else, please talk to us!
-* Stay on top of deadlines. Announcements will also be sent out every Monday morning, but you
-should keep a copy of the course schedule where you see it easily.
-* Read your emails!
-
-== Information about the Instructors
-
-=== The Data Mine Staff
-
-[%header,format=csv]
-|===
-Name, Title
-Shared email we all read, datamine-help@purdue.edu
-Kevin Amstutz, Senior Data Scientist and Instruction Specialist
-Maggie Betz, Managing Director of Corporate Partnerships
-Shuennhau Chang, Corporate Partners Senior Manager
-David Glass, Managing Director of Data Science
-Kali Lacy, Associate Research Engineer
-Naomi Mersinger, ASL Interpreter / Strategic Initiatives Coordinator
-Kim Rechkemmer, Senior Program Administration Specialist
-Nick Rosenorn, Corporate Partners Technical Specialist
-Katie Sanders, Operations Manager
-Rebecca Sharples, Managing Director of Academic Programs & Outreach
-Dr. Mark Daniel Ward, Director
-
-|===
-
-The Data Mine Team uses a shared email which functions as a ticketing system. Using a shared email helps the team manage the influx of questions, better distribute questions across the team, and send out faster responses.
-
-*For the purposes of getting help with this 1-credit seminar class, your most important people are:*
-
-* *T.A.s*: Visit their xref:fall2022/logistics/office_hours.adoc[office hours] and use the link:https://piazza.com/[Piazza site]
-* *Mr. Kevin Amstutz*, Senior Data Scientist and Instruction Specialist - Piazza is preferred method of questions
-* *Dr. Mark Daniel Ward*, Director: Dr. Ward responds to questions on Piazza faster than by email
-
-
-=== Communication Guidance
-
-* *For questions about how to do the homework, use Piazza or visit office hours*. You will receive the fastest response by using Piazza versus emailing us.
-* For general Data Mine questions, email datamine-help@purdue.edu
-* For regrade requests, use Gradescope's regrade feature within Brightspace. Regrades should be
-requested within 1 week of the grade being posted.
-
-
-=== Office Hours
-
-The xref:fall2022/logistics/office_hours.adoc[office hours schedule is posted here.]
-
-Office hours are held in person in Hillenbrand lobby and on Zoom. Check the schedule to see the available schedule.
-
-Piazza is an online discussion board where students can post questions at any time, and Data Mine staff or T.A.s will respond. Piazza is available through Brightspace. There are private and public postings. Last year we had over 11,000 interactions on Piazza, and the typical response time was around 5-10 minutes!
-
-
-== Assignments and Grades
-
-
-=== Course Schedule & Due Dates
-
-xref:fall2022/logistics/schedule.adoc[Click here to view the Fall 2022 Course Schedule]
-
-See the schedule and later parts of the syllabus for more details, but here is an overview of how the course works:
-
-In the first week of the beginning of the semester, you will have some "housekeeping" tasks to do, which include taking the Syllabus quiz and Academic Integrity quiz.
-
-Generally, every week from the very beginning of the semester, you will have your new projects released on a Thursday, and they are due 8 days later on the Friday at 11:55 pm Purdue West Lafayette (Eastern) time. You will need to do 3 Outside Event reflections.
-
-We will have 13 weekly projects available, but we only count your best 10. This means you could miss up to 3 projects due to illness or other reasons, and it won't hurt your grade. We suggest trying to do as many projects as possible so that you can keep up with the material. The projects are much less stressful if they aren't done at the last minute, and it is possible that our systems will be stressed if you wait until Friday night causing unexpected behavior and long wait times. Try to start your projects on or before Monday each week to leave yourself time to ask questions.
-
-The Data Mine does not conduct or collect an assessment during the final exam period. Therefore, TDM Courses are not required to follow the Quiet Period in the link:https://catalog.purdue.edu/content.php?catoid=15&navoid=18634#academic-calendar[Academic Calendar].
-
-[cols="4,1"]
-|===
-
-|Projects (best 10 out of Projects #1-13) |86%
-|Outside event reflections (3 total) |12%
-|Academic Integrity Quiz |1%
-|Syllabus Quiz |1%
-|*Total* |*100%*
-
-|===
-
-
-=== Grading Scale
-In this class grades reflect your achievement throughout the semester in the various course components listed above. Your grades will be maintained in Brightspace. This course will follow the 90-80-70-60 grading scale for A, B, C, D cut-offs. If you earn a 90.000 in the class, for example, that is a solid A. +/- grades will be given at the instructor's discretion below these cut-offs. If you earn an 89.11 in the class, for example, this may be an A- or a B+.
-
-* A: 100.000% - 90.000%
-* B: 89.999% - 80.000%
-* C: 79.999% - 70.000%
-* D: 69.999% - 60.000%
-* F: 59.999% - 0.000%
-
-
-=== Late Policy
-
-We generally do NOT accept late work. For the projects, we count only your best 10 out of 13, so that gives you a lot of flexibility. We need to be able to post answer keys for the rest of the class in a timely manner, and we can't do this if we are waiting for other students to turn their work in.
-
-
-=== Projects
-
-* The projects will help you achieve Learning Outcomes #2-5.
-* Each weekly programming project is worth 10 points.
-* There will be 13 projects available over the semester, and your best 10 will count.
-* The 3 project grades that are dropped could be from illnesses, absences, travel, family
-emergencies, or simply low scores. No excuses necessary.
-* No late work will be accepted, even if you are having technical difficulties, so do not work at the
-last minute.
-* There are many opportunities to get help throughout the week, either through Piazza or office
-hours. We're waiting for you! Ask questions!
-* Follow the instructions for how to submit your projects properly through Gradescope in
-Brightspace.
-* It is ok to get help from others or online, although it is important to document this help in the
-comment sections of your project submission. You need to say who helped you and how they
-helped you.
-* Each week, the project will be posted on the Thursday before the seminar, the project will be
-the topic of the seminar and any office hours that week, and then the project will be due by
-11:55 pm Eastern time on the following Friday. See the schedule for specific dates.
-* If you need to request a regrade on any part of your project, use the regrade request feature
-inside Gradescope. The regrade request needs to be submitted within one week of the grade being posted (we send an announcement about this).
-
-
-=== Outside Event Reflections
-
-* The Outside Event reflections will help you achieve Learning Outcome #1. They are an opportunity for you to learn more about data science applications, career development, and diversity.
-* You are required to attend 3 of these over the semester, with 1 due each month. See the schedule for specific due dates. Feel free to complete them early.
-** Outside Event Reflections *must* be submitted within 1 week of attending the event or watching the recording.
-** At least one of these events should by on the topic of Professional Development (designated by "PD" on the schedule)
-* Find outside events posted on The Data Mine's website (https://datamine.purdue.edu/events/) and updated frequently. Let us know about any good events you hear about.
-* Format of Outside Events:
-** Often in person so you can interact with the presenter!
-** Occasionally online and possibly recorded
-* Follow the instructions in Gradescope for writing and submitting these reflections.
-*** Name of the event and speaker
-*** The time and date of the event
-*** What was discussed at the event
-*** What you learned from it
-*** What new ideas you would like to explore as a result of what you learned at the event
-*** AND what question(s) you would like to ask the presenter if you met them at an after-presentation reception.
-* We read every single reflection! We care about what you write! We have used these connections to provide new opportunities for you, to thank our speakers, and to learn more about what interests you.
-
-=== Academic Integrity
-
-Academic integrity is one of the highest values that Purdue University holds. Individuals are encouraged to alert university officials to potential breaches of this value by either link:mailto:integrity@purdue.edu[emailing] or by calling 765-494-8778. While information may be submitted anonymously, the more information that is submitted provides the greatest opportunity for the university to investigate the concern.
-
-In TDM 10100/20100/30100/40100/50100, we encourage students to work together. However, there is a difference between good collaboration and academic misconduct. We expect you to read over this list, and you will be held responsible for violating these rules. We are serious about protecting the hard-working students in this course. We want a grade for The Data Mine seminar to have value for everyone and to represent what you truly know. We may punish both the student who cheats and the student who allows or enables another student to cheat. Punishment could include receiving a 0 on a project, receiving an F for the course, and incidents of academic misconduct reported to the Office of The Dean of Students.
-
-*Good Collaboration:*
-
-* First try the project yourself, on your own.
-* After trying the project yourself, then get together with a small group of other students who
-have also tried the project themselves to discuss ideas for how to do the more difficult problems. Document in the comments section any suggestions you took from your classmates or your TA.
-* Finish the project on your own so that what you turn in truly represents your own understanding of the material.
-* Look up potential solutions for how to do part of the project online, but document in the comments section where you found the information.
-* If the assignment involves writing a long, worded explanation, you may proofread somebody's completed written work and allow them to proofread your work. Do this only after you have both completed your own assignments, though.
-
-*Academic Misconduct:*
-
-* Divide up the problems among a group. (You do #1, I'll do #2, and he'll do #3: then we'll share our work to get the assignment done more quickly.)
-* Attend a group work session without having first worked all of the problems yourself.
-* Allowing your partners to do all of the work while you copy answers down, or allowing an
-unprepared partner to copy your answers.
-* Letting another student copy your work or doing the work for them.
-* Sharing files or typing on somebody else's computer or in their computing account.
-* Getting help from a classmate or a TA without documenting that help in the comments section.
-* Looking up a potential solution online without documenting that help in the comments section.
-* Reading someone else's answers before you have completed your work.
-* Have a tutor or TA work though all (or some) of your problems for you.
-* Uploading, downloading, or using old course materials from Course Hero, Chegg, or similar sites.
-* Using the same outside event reflection (or parts of it) more than once. Using an outside event reflection from a previous semester.
-* Using somebody else's outside event reflection rather than attending the event yourself.
-
-The link:https://www.purdue.edu/odos/osrr/honor-pledge/about.html[Purdue Honor Pledge] "As a boilermaker pursuing academic excellence, I pledge to be honest and true in all that I do. Accountable together - we are Purdue"
-
-Please refer to the link:https://www.purdue.edu/odos/osrr/academic-integrity/index.html[student guide for academic integrity] for more details.
-
-=== Disclaimer
-This syllabus is subject to change. Changes will be made by an announcement in Brightspace and the corresponding course content will be updated.
-
-== xref:fall2022/logistics/syllabus_purdue_policies.adoc[Purdue Policies & Resources]
-Includes:
-
-* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Academic Guidance in the Event a Student is Quarantined/Isolated[Academic Guidance in the Event a Student is Quarantined/Isolated]
-* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Class Behavior[Class Behavior]
-* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Nondiscrimination Statement[Nondiscrimination Statement]
-* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Students with Disabilities[Students with Disabilities]
-* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Mental Health Resources[Mental Health Resources]
-* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Violent Behavior Policy[Violent Behavior Policy]
-* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Diversity and Inclusion Statement[Diversity and Inclusion Statement]
-* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Basic Needs Security Resources[Basic Needs Security Resources]
-* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Course Evaluation[Course Evaluation]
-* xref:fall2022/logistics/syllabus_purdue_policies.adoc#General Classroom Guidance Regarding Protect Purdue[General Classroom Guidance Regarding Protect Purdue]
-* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Campus Emergencies[Campus Emergencies]
-* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Illness and other student emergencies[Illness and other student emergencies]
-* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Disclaimer[Disclaimer]
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/logistics/syllabus_purdue_policies.adoc b/projects-appendix/modules/ROOT/pages/fall2022/logistics/syllabus_purdue_policies.adoc
deleted file mode 100644
index ed37cd3d7..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/logistics/syllabus_purdue_policies.adoc
+++ /dev/null
@@ -1,75 +0,0 @@
-== Purdue Policies & Resources
-
-=== Academic Guidance in the Event a Student is Quarantined/Isolated
-While everything we are doing in The Data Mine this semester can be done online, rather than in person, and no part of your seminar grade comes from attendance, we want to remind you of general campus attendance policies during COVID-19. Students should stay home and contact the Protect Purdue Health Center (496-INFO) if they feel ill, have any symptoms associated with COVID-19, or suspect they have been exposed to the virus. In the current context of COVID-19, in-person attendance will not be a factor in the final grades, but the student still needs to inform the instructor of any conflict that can be anticipated and will affect the submission of an assignment. Only the instructor can excuse a student from a course requirement or responsibility. When conflicts can be anticipated, such as for many University-sponsored activities and religious observations, the student should inform the instructor of the situation as far in advance as possible. For unanticipated or emergency conflict, when advance notification to an instructor is not possible, the student should contact the instructor as soon as possible by email or by phone. When the student is unable to make direct contact with the instructor and is unable to leave word with the instructor's department because of circumstances beyond the student's control, and in cases of bereavement, quarantine, or isolation, the student or the student's representative should contact the Office of the Dean of Students via email or phone at 765-494-1747. Below are links on Attendance and Grief Absence policies under the University Policies menu.
-
-If you must miss class at any point in time during the semester, please reach out to me via email so that we can communicate about how you can maintain your academic progress. If you find yourself too sick to progress in the course, notify your adviser and notify me via email or Brightspace. We will make arrangements based on your particular situation. Please note the link:https://protect.purdue.edu/updates/video-update-protect-purdue-fall-expectations/[Protect Purdue fall 2022 expectations] announced on the Protect Purdue website.
-
-=== Class Behavior
-
-You are expected to behave in a way that promotes a welcoming, inclusive, productive learning environment. You need to be prepared for your individual and group work each week, and you need to include everybody in your group in any discussions. Respond promptly to all communications and show up for any appointments that are scheduled. If your group is having trouble working well together, try hard to talk through the difficulties--this is an important skill to have for future professional experiences. If you are still having difficulties, ask The Data Mine staff to meet with your group.
-
-
-*Purdue's Copyrighted Materials Policy:*
-
-Among the materials that may be protected by copyright law are the lectures, notes, and other material presented in class or as part of the course. Always assume the materials presented by an instructor are protected by copyright unless the instructor has stated otherwise. Students enrolled in, and authorized visitors to, Purdue University courses are permitted to take notes, which they may use for individual/group study or for other non-commercial purposes reasonably arising from enrollment in the course or the University generally.
-Notes taken in class are, however, generally considered to be "derivative works" of the instructor's presentations and materials, and they are thus subject to the instructor's copyright in such presentations and materials. No individual is permitted to sell or otherwise barter notes, either to other students or to any commercial concern, for a course without the express written permission of the course instructor. To obtain permission to sell or barter notes, the individual wishing to sell or barter the notes must be registered in the course or must be an approved visitor to the class. Course instructors may choose to grant or not grant such permission at their own discretion, and may require a review of the notes prior to their being sold or bartered. If they do grant such permission, they may revoke it at any time, if they so choose.
-
-=== Nondiscrimination Statement
-Purdue University is committed to maintaining a community which recognizes and values the inherent worth and dignity of every person; fosters tolerance, sensitivity, understanding, and mutual respect among its members; and encourages each individual to strive to reach his or her own potential. In pursuit of its goal of academic excellence, the University seeks to develop and nurture diversity. The University believes that diversity among its many members strengthens the institution, stimulates creativity, promotes the exchange of ideas, and enriches campus life. link:https://www.purdue.edu/purdue/ea_eou_statement.php[Link to Purdue's nondiscrimination policy statement.]
-
-=== Students with Disabilities
-Purdue University strives to make learning experiences as accessible as possible. If you anticipate or experience physical or academic barriers based on disability, you are welcome to let me know so that we can discuss options. You are also encouraged to contact the Disability Resource Center at: link:mailto:drc@purdue.edu[drc@purdue.edu] or by phone: 765-494-1247.
-
-If you have been certified by the Office of the Dean of Students as someone needing a course adaptation or accommodation because of a disability OR if you need special arrangements in case the building must be evacuated, please contact The Data Mine staff during the first week of classes. We are happy to help you.
-
-=== Mental Health Resources
-
-* *If you find yourself beginning to feel some stress, anxiety and/or feeling slightly overwhelmed,* try link:https://purdue.welltrack.com/[WellTrack]. Sign in and find information and tools at your fingertips, available to you at any time.
-* *If you need support and information about options and resources*, please contact or see the link:https://www.purdue.edu/odos/[Office of the Dean of Students]. Call 765-494-1747. Hours of operation are M-F, 8 am- 5 pm.
-* *If you find yourself struggling to find a healthy balance between academics, social life, stress*, etc. sign up for free one-on-one virtual or in-person sessions with a link:https://www.purdue.edu/recwell/fitness-wellness/wellness/one-on-one-coaching/wellness-coaching.php[Purdue Wellness Coach at RecWell]. Student coaches can help you navigate through barriers and challenges toward your goals throughout the semester. Sign up is completely free and can be done on BoilerConnect. If you have any questions, please contact Purdue Wellness at evans240@purdue.edu.
-* *If you're struggling and need mental health services:* Purdue University is committed to advancing the mental health and well-being of its students. If you or someone you know is feeling overwhelmed, depressed, and/or in need of mental health support, services are available. For help, such individuals should contact link:https://www.purdue.edu/caps/[Counseling and Psychological Services (CAPS)] at 765-494-6995 during and after hours, on weekends and holidays, or by going to the CAPS office of the second floor of the Purdue University Student Health Center (PUSH) during business hours.
-
-=== Violent Behavior Policy
-
-Purdue University is committed to providing a safe and secure campus environment for members of the university community. Purdue strives to create an educational environment for students and a work environment for employees that promote educational and career goals. Violent Behavior impedes such goals. Therefore, Violent Behavior is prohibited in or on any University Facility or while participating in any university activity. See the link:https://www.purdue.edu/policies/facilities-safety/iva3.html[University's full violent behavior policy] for more detail.
-
-=== Diversity and Inclusion Statement
-
-In our discussions, structured and unstructured, we will explore a variety of challenging issues, which can help us enhance our understanding of different experiences and perspectives. This can be challenging, but in overcoming these challenges we find the greatest rewards. While we will design guidelines as a group, everyone should remember the following points:
-
-* We are all in the process of learning about others and their experiences. Please speak with me, anonymously if needed, if something has made you uncomfortable.
-* Intention and impact are not always aligned, and we should respect the impact something may have on someone even if it was not the speaker's intention.
-* We all come to the class with a variety of experiences and a range of expertise, we should respect these in others while critically examining them in ourselves.
-
-=== Basic Needs Security Resources
-
-Any student who faces challenges securing their food or housing and believes this may affect their performance in the course is urged to contact the Dean of Students for support. There is no appointment needed and Student Support Services is available to serve students from 8:00 - 5:00, Monday through Friday. The link:https://www.purdue.edu/vpsl/leadership/About/ACE_Campus_Pantry.html[ACE Campus Food Pantry] is open to the entire Purdue community).
-
-Considering the significant disruptions caused by the current global crisis as it related to COVID-19, students may submit requests for emergency assistance from the link:https://www.purdue.edu/odos/resources/critical-need-fund.html[Critical Needs Fund].
-
-=== Course Evaluation
-
-During the last two weeks of the semester, you will be provided with an opportunity to give anonymous feedback on this course and your instructor. Purdue uses an online course evaluation system. You will receive an official email from evaluation administrators with a link to the online evaluation site. You will have up to 10 days to complete this evaluation. Your participation is an integral part of this course, and your feedback is vital to improving education at Purdue University. I strongly urge you to participate in the evaluation system.
-
-You may email feedback to us anytime at link:mailto:datamine-help@purdue.edu[datamine-help@purdue.edu]. We take feedback from our students seriously, as we want to create the best learning experience for you!
-
-=== General Classroom Guidance Regarding Protect Purdue
-
-Any student who has substantial reason to believe that another person is threatening the safety of others by not complying with Protect Purdue protocols is encouraged to report the behavior to and discuss the next steps with their instructor. Students also have the option of reporting the behavior to the link:https://purdue.edu/odos/osrr/[Office of the Student Rights and Responsibilities]. See also link:https://catalog.purdue.edu/content.php?catoid=7&navoid=2852#purdue-university-bill-of-student-rights[Purdue University Bill of Student Rights] and the Violent Behavior Policy under University Resources in Brightspace.
-
-=== Campus Emergencies
-
-In the event of a major campus emergency, course requirements, deadlines and grading percentages are subject to changes that may be necessitated by a revised semester calendar or other circumstances. Here are ways to get information about changes in this course:
-
-* Brightspace or by e-mail from Data Mine staff.
-* General information about a campus emergency can be found on the Purdue website: xref:www.purdue.edu[].
-
-
-=== Illness and other student emergencies
-
-Students with *extended* illnesses should contact their instructor as soon as possible so that arrangements can be made for keeping up with the course. Extended absences/illnesses/emergencies should also go through the Office of the Dean of Students.
-
-=== Disclaimer
-This syllabus is subject to change. Changes will be made by an announcement in Brightspace and the corresponding course content will be updated.
-
diff --git a/projects-appendix/modules/ROOT/pages/fall2022/logistics/ta_schedule.adoc b/projects-appendix/modules/ROOT/pages/fall2022/logistics/ta_schedule.adoc
deleted file mode 100644
index db47547ed..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2022/logistics/ta_schedule.adoc
+++ /dev/null
@@ -1,6 +0,0 @@
-= Seminar TA Fall 2022 Schedule
-
-++++
-
-++++
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project01.adoc
deleted file mode 100644
index e7e5bf012..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project01.adoc
+++ /dev/null
@@ -1,409 +0,0 @@
-= TDM 10100: Project 1 -- 2023
-
-**Motivation:** In this project we are going to jump head first into The Data Mine. We will load datasets into the R environment, and introduce some core programming concepts like variables, vectors, types, etc. As we will be "living" primarily in an IDE called Jupyter Lab, we will take some time to learn how to connect to it, configure it, and run code.
-
-[NOTE]
-====
-IDE stands for Integrated Developer Environment: software that helps us program cleanly and efficiently.
-====
-
-**Context:** This is our first project as a part of The Data Mine. We will get situated, configure the environment we will be using throughout our time with The Data Mine, and jump straight into working with data!
-
-**Scope:** R, Jupyter Lab, Anvil
-
-.Learning Objectives
-****
-- Read about and understand computational resources available to you.
-- Learn how to run R code in Jupyter Lab on Anvil.
-- Read and write basic (.csv) data using R.
-****
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/flights/subset/1991.csv`
-- `/anvil/projects/tdm/data/movies_and_tv/imdb.db`
-- `/anvil/projects/tdm/data/disney/flight_of_passage.csv`
-
-== Setting Up to Work
-
-++++
-
-++++
-
-
-This year we will be using Jupyter Lab on the Anvil cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster.
-
-Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (including 2-factor authentication using Duo Mobile). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward.
-
-[TIP]
-====
-If you did not (yet) set up your 2-factor authentication credentials with Duo, you can set up the credentials here: https://the-examples-book.com/starter-guides/anvil/access-setup
-====
-
-Towards the middle of the top menu, click on the item labeled btn:[My Interactive Sessions]. (Depending on the size of your browser window, there might only be an icon; it is immediately to the right of the menu item for The Data Mine.) On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Bioinformatics, Interactive Apps, and The Data Mine. Under "The Data Mine" section, near the bottom of your screen, click on btn:[Jupyter Notebook]. (Make sure that you choose the Jupyter Notebook from "The Data Mine" section.)
-
-If everything was successful, you should see a screen similar to the following.
-
-image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"]
-
-Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 1 CPU core and 1918 MB of memory.
-
-[NOTE]
-====
-As you can see in the screenshot above, each core is associated with 1918 MB of memory. If you know how much memory your project will need, you can use this value to choose how many cores you want. In this and most of the other projects in this class, 1-2 cores is generally enough.
-====
-
-[NOTE]
-====
-Please use 4 cores for this project. This is _almost always_ excessive, but for this project in question 3 you will be reading in a rather large dataset that will very likely crash your kernel without at least 3-4 cores.
-====
-
-We use the Anvil cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine.
-
-After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on this button to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following.
-
-image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"]
-
-There are 2 primary options that you will need to know about.
-
-seminar::
-The `seminar` kernel runs Python code but also has the ability to run R code or SQL queries in the same environment.
-
-[TIP]
-====
-To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page].
-====
-
-seminar-r::
-The `seminar-r` kernel is intended for projects that **only** use R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell.
-
-For now, let's focus on the `seminar` kernel. Click on btn:[seminar], and a fresh notebook will be created for you.
-
-
-The first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`).
-
-Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`.
-
-There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain comments about your work).
-
-Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A.
-
-[TIP]
-====
-Make sure to read about and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-====
-
-
-== Questions
-
-=== Question 1 (1 pt)
-[upperalpha]
-.. How many cores and how much memory (in GB) does Anvil's sub-cluster A have? (0.5 pts)
-.. How many cores and how much memory (in GB) does your personal computer have? (0.5 pts)
-
-++++
-
-++++
-
-
-For this course, projects will be solved using the https://www.rcac.purdue.edu/compute/anvil[Anvil computing cluster].
-
-Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (RAM). Use the information on the provided webpages to manually calculate how many cores and how much memory is available for Anvil's "sub-cluster A".
-
-Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer of your own, work with a friend to see how many cores there are, and how much memory is available, on their computer.
-
-[TIP]
-====
-Information about the core and memory capacity of Anvil "sub-clusters" can be found https://www.rcac.purdue.edu/compute/anvil[here].
-
-Information about the core and memory capacity of your computer is typically found in the "About this PC" section of your computer's settings.
-====
-
-.Items to submit
-====
-- A sentence (in a markdown cell) explaining how many cores and how much memory is available to Anvil sub-cluster A.
-- A sentence (in a markdown cell) explaining how many cores and how much memory is available, in total, for your own computer.
-====
-
-=== Question 2 (2 pts)
-[upperalpha]
-.. Using Python, what is the name of the node on Anvil you are running on?
-.. Using Bash, what is the name of the node on Anvil you are running on?
-.. Using R, what is the name of the node on Anvil you are running on?
-
-++++
-
-++++
-
-Our next step will be to test out our connection to the Anvil Computing Cluster! Run the following code snippets in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on (in three different languages!). What is the name of the node on Anvil that you are running on?
-
-[source,python]
-----
-import socket
-print(socket.gethostname())
-----
-
-[source,r]
-----
-%%R
-
-system("hostname", intern=TRUE)
-----
-
-[source,bash]
-----
-%%bash
-
-hostname
-----
-
-[TIP]
-====
-To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu.
-====
-
-Check the results of each code snippet to ensure they all return the same hostname. Do they match? You may notice that `R` prints some extra "junk" output, while `bash` and `Python` do not. This is nothing to be concerned about. (Different languages have different types of output.)
-
-.Items to submit
-====
-- Code used to solve this problem, along with the output of running that code.
-====
-
-=== Question 3 (2 pts)
-[upperalpha]
-.. Run each of the example code snippets below, and include them and their output in your submission to get credit for this question.
-
-++++
-
-++++
-
-
-[TIP]
-====
-Remember, in the upper right-hand corner of your notebook you will see the current kernel for the notebook, `seminar`. If you click on this name you will have the option to swap kernels out -- no need to do this now, but it is good to know!
-====
-
-In this course, we will be using Jupyter Lab with multiple different languages. Often, we will center a project around a specific language and choose the kernel for that langauge appropriately, but occasionally we may need to run a language in a kernel other than the one it is primarily built for. The solution to this is using line magic!
-
-Line magic tells our code interpreter that we are using a language other than the default for our kernel (i.e. The `seminar` kernel we are currently using is expecting Python code, but we can tell it to expect R code instead.)
-
-Line magic works by having the very first line in a code cell formatted like so:
-
-`%%language`
-
-Where `language` is the language we want to use. For example, if we wanted to run R code in our `seminar` kernel, we would use the following line magic:
-
-`%%R`
-
-Practice running the following examples, which include line magic where needed.
-
-python::
-[source,python]
-----
-import pandas as pd
-df = pd.read_csv('/anvil/projects/tdm/data/flights/subset/1991.csv')
-----
-
-[source,python]
-----
-df[df["Month"]==12].head() # see information about a few of the flights from December 1991
-----
-
-SQL::
-[source, ipython]
-----
-%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db
-----
-
-[source, sql]
-----
-%%sql
-
--- see information about a few TV episodes called "Finale"
-SELECT *
-FROM episodes AS e
-INNER JOIN titles AS t
-ON t.title_id = e.episode_title_id
-WHERE t.primary_title = 'Finale'
-LIMIT 5;
-----
-
-bash::
-[source,bash]
-----
-%%bash
-
-names="John Doe;Bill Withers;Arthur Morgan;Mary Jane;Rick Ross;John Marston"
-echo $names | cut -d ';' -f 3
-echo $names | cut -d ';' -f 6
-----
-
-[NOTE]
-====
-In the above examples you will see lines such as `%%R` or `%%sql`. These are called "Line Magic". They allow you to run non-Python code in the `seminar` kernel. In order for line magic to work, it MUST be on the first line of the code cell it is being used in (before any comments or any code in that cell).
-
-In the future, you will likely stick to using the kernel that matches the project language, but we wanted you to have a demonstration about "line magic" in Project 1. Line magic is a handy trick to know!
-
-To learn more about how to run various types of code using the `seminar` kernel, see https://the-examples-book.com/projects/templates[our template page].
-====
-
-.Items to submit
-====
-- Code from the examples above, and the outputs produced by running that code.
-====
-
-=== Question 4 (1 pt)
-[upperalpha]
-.. Using Python, calculate how how much memory (in bytes) the A sub-cluster of Anvil has. Calculate how much memory (in TB) the A sub-cluster of Anvil has. (0.5 pts)
-.. Using R, calculate how how much memory (in bytes) the A sub-cluster of Anvil has. Calculate how much memory (in TB) the A sub-cluster of Anvil has. (0.5 pts)
-
-
-++++
-
-++++
-
-
-[NOTE]
-====
-"Comments" are text in code cells that are not "run" as code. They serve as helpful notes on how your code works. Always comment your code well enough that you can come back to it after a long amount of time and understand what you wrote. In R and Python, single-line comments can be made by putting `#` at the beginning of the line you want commented out.
-====
-
-[NOTE]
-====
-Spacing in code is sometimes important, sometimes not. The two things you can do to find out what applies in your case are looking at documentation online and experimenting on your own, but we will also try to stress what spacing is mandatory and what is a style decision in our videos.
-====
-
-In question 1 we answered questions about cores and memory for the Anvil clusters. This time, we want you to convert your GB memory amount from question 1 into bytes and terabytes. Instead of using a calculator (or paper, or mental math for you good-at-mental-math folks), write these calculations using R _and_ Python, in separate code cells.
-
-[TIP]
-====
-A Gigabyte is 1,000,000,000 bytes.
-A Terabyte is 1,000 Gigabytes.
-====
-
-[TIP]
-====
-https://www.datamentor.io/r-programming/operator[This link] will point you to resources about how to use basic operators in R, and https://www.tutorialspoint.com/python/python_basic_operators.htm[this one] will teach you about basic operators in Python.
-====
-
-.Items to submit
-====
-- Python code to calculate the amount of memory in Anvil sub-cluster A in bytes and TB, along with the output from running that code.
-- R code to calculate the amount of memory in Anvil sub-cluster A in bytes and TB, along with the output from running that code.
-====
-
-=== Question 5 (2 pts)
-[upperalpha]
-.. Load the "flight_of_passage.csv" data into an R dataframe called "dat". (0.5 pts)
-.. Take the head of "dat" to ensure your data loaded in correctly. (0.5 pts)
-.. Change the name of "dat" to "flight_of_passage", remove the reference to "dat", and then take the head of "flight of passage" in order to ensure that your actions were successful. (1 pt)
-
-
-++++
-
-++++
-
-
-In the previous question, we ran our first R and Python code (aside from _provided_ code). In the fall semester, we will focus on learning R. In the spring semester, we will learn some Python. Throughout the year, we will always be focused on working with data, so we must learn how to load data into memory. Load your first dataset into R by running the following code.
-
-[source,ipython]
-----
-%%R
-
-dat <- read.csv("/anvil/projects/tdm/data/disney/flight_of_passage.csv")
-----
-
-Confirm that the dataset has been read in by passing the dataset, `dat`, to the `head()` function. The `head` function will return the first 5 rows of the dataset.
-
-[source,r]
-----
-%%R
-
-head(dat)
-----
-
-[IMPORTANT]
-====
-Remember -- if you are in a _new_ code cell on the , you'll need to add `%%R` to the top of the code cell, otherwise, Jupyter will try to run your R code using the _Python_ interpreter -- that would be no good!
-====
-
-`dat` is a variable that contains our data! We can name this variable anything we want. We do _not_ have to name it `dat`; we can name it `my_data` or `my_data_set`.
-
-Run our code to read in our dataset, this time, instead of naming our resulting dataset `dat`, name it `flight_of_passage`. Place all of your code into a new cell. Be sure there is a level 2 header titled "Question 5", above your code cell.
-
-[TIP]
-====
-In markdown, a level 2 header is any line starting with 2 hashtags. For example, `Question X` with two hashtags beforehand is a level 2 header. When rendered, this text will appear much larger. You can read more about markdown https://guides.github.com/features/mastering-markdown/[here].
-====
-
-[NOTE]
-====
-We didn't need to re-read in our data in this question to make our dataset be named `flight_of_passage`. We could have re-named `dat` to be `flight_of_passage` like this.
-
-[source,r]
-----
-flight_of_passage <- dat
-----
-
-Some of you may think that this isn't exactly what we want, because we are copying over our dataset. You are right, this is certainly _not_ what we want! What if it was a 5GB dataset, that would be a lot of wasted space! Well, R does copy on modify. What this means is that until you modify either `dat` or `flight_of_passage` the dataset isn't copied over. You can therefore run the following code to remove the other reference to our dataset.
-
-[source,r]
-----
-rm(dat)
-----
-====
-
-.Items to submit
-====
-- Code to load the data into a dataframe called `dat` and take the head of that data, and the output of that code.
-- Code to change the name of `dat` to `flight_of_passage` and remove the variable `dat`, and to take the head of `flight_of_passage` to ensure the name-change worked.
-====
-
-=== Submitting your Work
-
-
-++++
-
-++++
-
-
-Congratulations, you just finished your first assignment for this class! Now that we've written some code and added some markdown cells to explain what we did, we are ready to submit our assignment. For this course, we will turn in a variety of files, depending on the project.
-
-We will always require a Jupyter Notebook file. Jupyter Notebook files end in `.ipynb`. This is our "source of truth" and what the graders will turn to first when grading.
-
-[WARNING]
-====
-You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this.
-
-You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this.
-====
-
-A `.ipynb` file is generated by first running every cell in the notebook (which can be done quickly by pressing the "double play" button along the top of the page), and then clicking the "Download" button from menu:File[Download].
-
-In addition to the `.ipynb` file, an additional file should be included for each programming language in the project containing all of the code from that langauge that is in the project. A full list of files required for the submission will be listed at the bottom of the project page.
-
-Let's practice. Take the R code from this project and copy and paste it into a text file with the `.R` extension. Call it `firstname-lastname-project01.R`. Do the same for each programming language, and ensure that all files in the submission requirements below are included. Once complete, submit all files as named and listed below to Gradescope.
-
-.Items to submit
-====
-- `firstname-lastname-project01.ipynb`.
-- `firstname-lastname-project01.R`.
-- `firstname-lastname-project01.py`.
-- `firstname-lastname-project01.sql`.
-- `firstname-lastname-project01.sh`.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
-
-Here is the Zoom recording of the 4:30 PM discussion with students from 21 August 2023:
-
-++++
-
-++++
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project02.adoc
deleted file mode 100644
index edaa5f3d4..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project02.adoc
+++ /dev/null
@@ -1,301 +0,0 @@
-= TDM 10100: Project 2 -- 2023
-Introduction to R part I
-
-In this project we will dive in head-first and learn some of the basics while solving data-driven problems.
-
-
-[NOTE]
-====
-**5 Basic Types of Data**
-
- * Values like 1.5 are called numeric values, real numbers, decimal numbers, etc.
- * Values like 7 are called integers or whole numbers.
- * Values TRUE or FALSE are called logical values or Boolean values.
- * Texts consist of sequences of words (also called strings), and words consist of sequences of characters.
- * Values such as 3 + 2ifootnote:[https://stat.ethz.ch/R-manual/R-devel/library/base/html/complex.html] are called complex numbers. We usually do not encounter these in The Data Mine.
-====
-
-
-
-[NOTE]
-====
-R and Python both have their advantages and disadvantages. A key part of learning data science methods is to understand the situations in which R is a more helpful tool to use, or Python is a more helpful tool to use. Both of them are good for their own purposes. In a similar way, hammers and screwdrivers and drills and many other tools are useful for construction, but they all have their own individual purposes.
-
-In addition, there are many other languages and tools, e.g., https://julialang.org/[Julia] and https://www.rust-lang.org/[Rust] and https://go.dev/[Go] and many other languages are emerging as relatively newer languages that each have their own advantages.
-====
-
-**Context:** In the last project we set the stage for the rest of the semester. We got some familiarity with our project templates, and modified and ran some examples.
-
-In this project, we will continue to use R within Jupyter Lab to solve problems. Soon, you will see how powerful R is and why it is often more effective than using spreadsheets as a tool for data analysis.
-
-**Scope:** xref:programming-languages:R:index.adoc[R], xref:programming-languages:R:lists-and-vectors.adoc[vectors, lists], https://rspatial.org/intr/4-indexing.html[indexing]
-
-.Learning Objectives
-****
-- Be aware of the different concepts and when to apply them; such as lists, vectors, factors, and data.frames
-- Be able to explain and demonstrate: positional, named, and logical indexing.
-- Read and write basic (csv) data using R.
-- Identify good and bad aspects of simple plots.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset:
-
-- `/anvil/projects/tdm/data/flights/subset/1995.csv`
-
-== Questions
-
-=== Question 1 (1 pt)
-[upperalpha]
-.. How many columns does this data frame have? (0.25 pts)
-.. How many rows does this data frame have? (0.25 pts)
-.. What type/s of data are in this data frame (example: numerical values, and/or text strings, etc.) (0.5 pts)
-
-[TIP]
-====
-"Kernel died" is a common error you could encounter during the semester. If you get a pop-up that says your "Kernel Died," it typically means that either 1) Anvil is down. Be sure to check your email and Piazza for updates, or 2) You need more cores for your project. Try starting a new session with an additional core. If you are using more than 4 cores, the problem is NOT this.
-====
-[TIP]
-====
-For this project, you will probably need to reserve 2-3 cores. Also, remember to use the `seminar-r` kernel going forward in this class.
-====
-
-
-++++
-
-++++
-
-
-It is important to get a good understanding of the dataset(s) with which you are working. This is the best first step to help solve any data-driven problems.
-
-We are going to use the `read.csv()` function to load our datasets into a data frame.
-
-To read data in R from a CSV file (.csv), you use the following command:
-
-[source,r]
-
-----
-myDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1995.csv")
-----
-
-[TIP]
-====
-R is a case-sensitive language, so if you try and take the head of `mydf` instead of `myDF`, it will not work.
-====
-
-[NOTE]
-====
-Here `myDF` is a variable - a name that references our data frame. In practice, you should always use names that are specific, descriptive, and meaningful.
-====
-
-We want to use functions such as `head`, `tail`, `dim`, `summary`, `str`, `class`, to get a better understanding of our data frame.
-
-[TIP]
-====
-- `head(myDF)` - Look at the head (or top) of the data frame
-
-- `tail(myDF)` - Look at the tail (or bottom) of the data frame
-
-- `class(myDF$Dest)` - Return the type of data in a column of the data frame, for instance, in a column that stores the destination of flights (Dest)
-
-- Try and figure out `dim`, `summary`, and `str` on your own, but we give some details about them in the video as well.
-====
-
-.Items to submit
-====
-- Code used to solve sub-questions A, B, and C, and the output from running that code.
-- The number of columns and rows in the data frame, in a markdown cell.
-- A list of all of the types of data present in the data frame, in a markdown cell.
-====
-
-=== Question 2 (1 pt)
-[upperalpha]
-.. What type of data is in the vector `myairports`? (0.5 pts)
-.. The vector `myairports` contains all of the airports where flights departed from in 1995. Print the first 250 of those airports. (Do not print all of the airports, because there are 5327435 such values!) How many of the first 250 flights departed from O'Hare? (0.5 pts)
-
-
-++++
-
-++++
-
-
-[NOTE]
-====
-A vector is a simple way to store a sequence of data. The data can be numeric data, logical data, textual data, etc.
-====
-
-Let's create a new https://sudo-labs.github.io/r-data-science/vectors/[vector] called `myairports` containing all of the origin airports (i.e., the airports where the flights departed) from the column `myDF$Origin` of the data frame `myDF`. We can do this using the `$` operator. Documentation on the `$` operator can be found https://statisticsglobe.com/meaning-of-dollar-operator-in-r[here], and an example of how to use it is given below.
-
-[source,r]
-----
-newVector <- myDF$ColumnName
-
-# to generate our vector, this would look like
-my_airports <- myDF$Origin
-----
-
-[TIP]
-====
-The `head()` function may help you with part B of this question.
-====
-
-.Items to submit
-====
-- Code used to create `myairports` and to solve the above sub-questions, and the output from running that code.
-- The type of data in your `myairports` vector in a markdown cell.
-- The number of flights that are from O'Hare in the first 250 entries of your `myairports` vector, in a markdown cell.
-====
-
-=== Question 3 (2 pts)
-
-[upperalpha]
-.. How many flights departed from Indianapolis (`IND`) in 1995? How many flights landed in Indianapolis (`IND`) in 1995? (1 pt)
-.. Consider the flight data from row 894 the data frame. What airport did it depart from? Where did it arrive? (0.5 pts)
-.. How many flights have a distance of less than 200 miles? (0.5 pts)
-
-
-++++
-
-++++
-
-
-There are many different ways to access data after we load it, and each has its own use case. One of the most common ways to access data is called _indexing_. Indexing is a way of selecting or excluding specific elements in our data. This is best shown through examples, some of which can be found https://rspatial.org/intr/4-indexing.html[here].
-
-[NOTE]
-====
-Accessing data can be done in many ways, one of those ways is called **_indexing_**. Typically we use brackets **[ ]** when indexing. By doing this we can select or even exclude specific elements. For example we can select a specific column and a certain range within the column. Some examples of symbols to help us select elements include: +
- * < less than +
- * > greater than +
- * \<= less than or equal to +
- * >= greater than or equal to +
- * == is equal +
- * != is not equal +
-====
-
-[NOTE]
-====
-Many programming languages, such as https://www.python.org/[Python] and https://www.learn-c.org/[C], are called "zero-indexed". This means that they begin counting from '0' instead of '1'. Because R is not zero-indexed, we can count like humans normally do. In other words, R starts numbering with row '1'.
-====
-
-.Helpful Examples
-====
-[source,r]
-----
-# get all of the data between row "row_index_start" and "row_index_end"
-myDF$Distance[row_index_start:row_index_end,]
-
-# get all of the data from row 3 of myDF
-myDF[3,]
-
-# get all of the data from column 5 of myDF
-myDF[,5]
-
-# get every row of data in the columns between
-# myfirstcolumn and mylastcolumn
-myDF[,myfirstcolumn:mylastcolumn]
-
-
-# get the first 250 values from column 17
-head(myDF[,17], n=250)
-
-# retrieves all rows with Distances greater than 100
-myDF$Distance[myDF$Distance > 100]
-
-# retrieve all flights with Origin equal to "ORD"
-myDF$Origin[myDF$Origin == "ORD"]
-----
-====
-
-.Items to submit
-====
-- Code used to solve each sub-question above, and the output from running it.
-- The number of flights that departed from Indianapolis in our data, in a markdown cell.
-- The number of flights that landed in Indianapolis in our data, in a markdown cell.
-- The origin and destination airport from row 894 of the data frame, in a markdown cell.
-- The number of flights that have distances less than 200 miles, in a markdown cell.
-====
-
-=== Question 4 (2 pts)
-[upperalpha]
-.. Rank the airline companies (in the column `myDF$UniqueCarrier`) according to their popularity, (i.e. according to the number of flights on each airline). (1 pt)
-.. Now find the ten airplanes that had the most flights in 1995. List them in order, from most popular to least popular. Do you notice anything unusual about the results? (1 pt)
-
-
-++++
-
-++++
-
-
-Oftentimes we will be dealing with enormous quantities of data, and it just isn't feasible to try and look at the data point-by-point in order to summarize the entire data frame. When we find ourselves in a situation like this, the `table()` function is here to save the day!
-
-Take a look at https://www.geeksforgeeks.org/create-table-from-dataframe-in-r/[this link] for some examples of how to use the `table()` function in R. Once you have a good understanding of how it works, try and answer the three sub-questions below using the `table()` function. You may need to use some other basic R functions as well.
-
-[NOTE]
-====
-It is useful to use functions in R and see how they behave, and then to take a function of the result, and take a function of that result, etc. For instance, it is common to summarize a vector in a table, and then sort the results, and then take the first few largest or smallest values. This is known as "nesting" functions, and is common throughout programming.
-
-====
-
-.Items to submit
-====
-- Code used to solve the sub-questions above, and the output from running it.
-- The airline company codes in order of popularity, in a markdown cell.
-- The ten airplane tail codes with the most flights in our data, ordered from most flights to least flights, in a markdown cell.
-====
-
-=== Question 5 (2 pts)
-[upperalpha]
-.. Using the R built-in function `hist()`, create a histogram of flight distances. Make sure your plot has an appropriate title and labelled axes for full credit. (1 pt)
-.. Write 2-3 sentences detailing any patterns you see in your plot and what those patterns tell you about the distance of flights in this dataset. (1 pt)
-
-++++
-
-++++
-
-Graphs are a very important tool in analyzing data. By visualizing our data in any of a number of ways, we can discover patterns that may not be as readily apparent by simply looking at tables. As such, they are a vital skill in all data scientists' skillset.
-
-In this question, we would like you to get comfortable with plotting in R. There are a number of built in tools for basic plotting in this language, but we will focus on histograms here. Using the `Distance` column of our data frame, create a histogram of the distribution of distances for our data. Then, write a few sentences describing your plot, any patterns you see, and what the distribution as a whole looks like.
-
-[TIP]
-====
-https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/hist.html[Documentation on R histograms] may help you understand how to complete this question.
-====
-
-.Items to submit
-====
-- Code used to generate your histogram.
-- A histogram of the distances of flights in our data with a title and labelled axes.
-- 2-3 sentences about the patterns in the data, and what those patterns tell you about the greater data, in a markdown cell.
-====
-
-=== Submitting your Work
-Congratulations, you've finished Project 2! Make sure that all of the below files are included in your submission, and feel free to come to seminar, post on Piazza, or visit some office hours if you have any further questions.
-
-.Items to submit
-====
-- `firstname-lastname-project02.ipynb`.
-- `firstname-lastname-project02.R`.
-====
-
-[WARNING]
-====
-You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this.
-
-You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
-
-Here is the Zoom recording of the 4:30 PM discussion with students from 28 August 2023:
-
-++++
-
-++++
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project03.adoc
deleted file mode 100644
index b47bb9812..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project03.adoc
+++ /dev/null
@@ -1,277 +0,0 @@
-= TDM 10100: Project 3 -- Fall 2023
-Inroduction to R part II
-
-**Motivation:** `data.frames` are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a `data.frame`.
-
-**Context:** In Project 2 we ran our first R code, learned about vectors and indexing, and explored some basic functions in R. In this project, we will continue to enforce what we've already learned and learn more about how dataframes, formally called `data.frame`, work in R.
-
-**Scope:** r, data.frames, factors
-
-.Learning Objectives
-****
-- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- List the differences between lists, vectors, factors, and data.frames, and when to use each.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/craigslist/vehicles.csv`
-
-== Setting Up
-First, let's take a look at all of the data available to students. In order to do this, we are going to use a new function as listed below to list all of the files in the craigslist folder.
-
-Let's run the below command using the *seminar-r* kernel to view all the files in the folder.
-
-[source,r]
-----
-list.files("/anvil/projects/tdm/data/craigslist")
-----
-
-
-As you can see, we have two different files worth of information from Craigslist.
-For this project, we are interested in looking at the `vehicles.csv` file
-
-++++
-
-++++
-
-
-Before we read in the data, we should check the size of the file to get an idea of how big it is. This is important because if the file is too large, we may need more cores for our project or else our core will 'die'.
-
-We can check the size of our file (in bytes) using the following command.
-[source,r]
-----
-file.info("/anvil/projects/tdm/data/craigslist/vehicles.csv")$size
-----
-
-[TIP]
-====
-You can also use `file.info` to see other information about the file.
-
-*size*- double: File size in bytes. +
-isdir- logical: Is the file a directory? +
-*mode*- integer of class "octmode". The file permissions, printed in octal, for example 644. +
-*mtime, ctime, atime*- integer of class "POSIXct": file modification, ‘last status change’ and last access times. +
-*uid*- integer: the user ID of the file's owner. +
-*gid*- integer: the group ID of the file's group. +
-*uname*- character: uid interpreted as a user name. +
-*grname* - character: gid interpreted as a group name. +
-(Unknown user and group names will be NA.)
-====
-
-Now that we have made sure our file isn't too big (1.44 GB), let's read it into a dataframe in the same way that we have done in the previous two projects.
-
-[TIP]
-====
-We recommend using 2 cores for your Jupyter Lab session this week.
-====
-
-Now we can read in the data and get started with our analysis.
-[source,r]
-----
-myDF <- read.csv("/anvil/projects/tdm/data/craigslist/vehicles.csv")
-----
-
-== Questions
-
-=== Question 1 (1 pt)
-
-++++
-
-++++
-
-[upperalpha]
-.. How many rows and columns does our dataframe have?
-.. What type/s of data are in this dataframe (example: numerical values, and/or text strings, etc.)
-.. 1-2 sentences giving an overall description of our data.
-
-As we stressed in Project 2, familiarizing yourself with the data you are going to work with is an important first step. For this question, we want to figure out how many rows and columns are in our data along with what the types of data are in our data frame. The hint below contains all of the functions that we need to solve this problem. (We also covered these functions in detail in Project 2, so feel free to reference the previous project if you want more information.)
-
-When answering sub-question C., consider talking about where the data appears to be taken from, what the data contains, and any important details that immediately stand out to you about the data.
-
-[TIP]
-====
-The `head()`, `dim()`, and `str()` functions could be helpful in answering this question.
-====
-
-.Items to submit
-====
-- The number of rows and columns in our dataframe, in a markdown cell.
-- The types of data in our dataframe, in a markdown cell.
-- 1-2 sentences summarizing our data.
-====
-
-=== Question 2 (1 pt)
-
-++++
-
-++++
-
-++++
-
-++++
-
-[upperalpha]
-.. Print the number of NA values in the *'year'* column of `myDF`, and the percentage of the total number of rows in `myDF` that this represents.
-.. Create a new data frame called `goodyearsDF` with only the rows of `myDF` that have a defined `year` (non `NA` values). Print the `head` of this new data frame.
-.. Create a new data frame called `missingyearsDF` with only the rows of `myDF` that *are* missing data in the `year` column. Print the `head` of this new data frame.
-
-Now that we have a better understanding of the general structure and contents of our data, let's focus on some specific patterns in our data that may make analysis more challenging.
-
-Often, one of these patterns is missing data. This can come in many forms, such as NA, NaN, NULL, or simply a blank space in one of our dataframes cells. When performing data analysis, it is important to consider missing data and decide how to handle it appropriately.
-
-In this question, we will look at filtering out rows with missing data. The `R` function `is.na()` indicates `TRUE` or `FALSE` is the analogous data is missing or not missing (respectively). An exclamation mark changes `TRUE` to `FALSE` and changes `FALSE` to `TRUE`. For this reason, `!is.na()` indicates which data are not `NA` values, in other words, which data are not missing. As an example, if we wanted to create a new dataframe with all of the rows that are not missing the latitude values, we could do any of the following equivalent methods:
-
-[source,r]
-----
-goodlatitudeDF <- subset(myDF, !is.na(myDF$lat))
-goodlatitudeDF <- subset(myDF, !is.na(lat))
-goodlatitudeDF <- myDF[!is.na(myDF$lat), ]
-----
-
-In the second method, the `subset` function knows that we are working with `myDF`, so we do not need to specify that `lat` is the latitude column in the `myDF` data frame, and instead, we can just refer to `lat` and the `subset` function knows that we are referring to a column.
-
-In the third method, when we write `myDF[ , ]` we put things before the comma that are conditions on the rows, and we put things after the comma that are conditions on the columns. So we are saying that we want rows of `myDF` for which the `lat` values are not `NA`, and we want all of the columns of `myDF`.
-
-If we compare the sizes of the original data frame and this new data frame, we can see that some rows were removed.
-
-[source,r]
-----
-dim(myDF)
-----
-
-[source,r]
-----
-dim(goodlatitudeDF)
-----
-
-To answer question 2, we want you to work (instead) with the `year` column, and try the same things that we demonstrated above from the `lat` column. We were simply giving you examples using the `lat` column, so that you have an example about how to deal with missing data in the `year` column.
-
-
-.Items to submit
-====
-- The number of NA values in the `year` column of `myDF` and the percentage of the total number of rows in `myDF` that this represents, in a markdown cell.
-- A dataframe called `goodyearsDF` containing only the rows in myDF that have a defined `year` (non NA values), and print the `head` of that data frame.
-- A dataframe called `missingyearsDF` containing only the rows in myDF that are missing the `year` data, and print the `head` of that data frame.
-====
-
-=== Question 3 (2 pts)
-
-++++
-
-++++
-
-++++
-
-++++
-
-[IMPORTANT]
-====
-Use the `myDF` data.frame for this question.
-====
-
-[upperalpha]
-.. Print the mean price of vehicles by `year` during the last 20 years.
-.. Find which `year` of vehicle appears most frequently in our data, and how frequently it occurs.
-
-
-[TIP]
-====
-Using the `aggregate` function is one possible way to solve this problem. An example of finding the mean `price` for each `type` of car is shown here:
-
-[source,r]
-----
-aggregate(price ~ type, data = myDF, FUN = mean)
-----
-====
-
-We want you to (instead) find the mean `price` for cars by `year`.
-
-[TIP]
-====
-Finding the most frequent value in our data can be done using `table`, which we have talked about previously, in conjunction with the `which.max` function. An example of finding the most frequent type of car is shown here:
-
-[source,r]
-----
-which.max(table(myDF$type))
-----
-====
-
-Now we want you to (instead) find the year in which the most cars appear in the data set.
-
-.Items to submit
-====
-- The mean price of each year of vehicle for the last 20 years, in a markdown cell.
-- The most frequent year in our data, and how frequently it occured.
-====
-
-=== Question 4 (2 pts)
-
-++++
-
-++++
-
-[upperalpha]
-.. Among the `region_url` values in the data set, which `region_url` is most popular?
-.. What are the three most popular states, in terms of the number of craigslist listings that appear?
-
-Use the `table`, `sort`, and `tail` commands to find the most popular `region_url` and the most popular three states.
-
-(These two questions are not related to each other. In other words, when you look for the three states that appear most frequently, they have nothing at all to do with the region_url that you found.)
-
-.Items to submit
-====
-- The most popular `region_url`.
-- The three states that appear most frequently.
-====
-
-
-=== Question 5 (2 pts)
-
-++++
-
-++++
-
-.. In question 3, we found the average price of vehicles by year. ("Average" and "mean" are two difference words for the very same concept.) Choose at least two different plot types in R, and create two plots that show the average vehicle price by year.
-.. Write 3-5 sentences detailing any patterns present in the data along with your personal observations. (i.e. shape, outliers, etc.)
-
-[NOTE]
-====
-Remember, all plots should have a title and appropriate axis labels. Axes should also be scaled appropriately. It is also necessary to explain your plot using a few sentences.
-====
-
-.Items to submit
-====
-- 2 different plots of average price of vehicle by year.
-- A 3-5 sentence explanation of any patterns present in the data along with your personal observations.
-====
-
-=== Submitting your Work
-Nice work, you've finished Project 3! Make sure that all of the below files are included in your submission, and feel free to come to seminar, post on Piazza, or visit some office hours if you have any further questions.
-
-.Items to submit
-====
-- `firstname-lastname-project01.ipynb`.
-- `firstname-lastname-project01.R`.
-====
-
-[WARNING]
-====
-You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this.
-
-You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project04.adoc
deleted file mode 100644
index 7e3294b70..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project04.adoc
+++ /dev/null
@@ -1,273 +0,0 @@
-= TDM 10100: Project 4 -- Fall 2023
-Introduction to R part III
-
-
-Many data science tools including have powerful ways to index data.
-
-[NOTE]
-====
-R typically has operations that are vectorized and there is little to no need to write loops. +
-R typically also uses indexing instead of using an if statement.
-
-* Sequential statements (one after another) i.e. +
-1. print line 45 +
-2. print line 15 +
-
-**if/else statements**
- create an order of direction based on a logical condition. +
-
-if statement example:
-[source,r]
-----
-x <- 7
-if (x > 0){
-print ("Positive number")
-}
-----
-else statement example:
-[source,r]
-----
-x <- -10
-if(x >= 0){
-print("Non-negative number")
-} else {
-print("Negative number")
-}
-----
-In `R`, we can classify many numbers all at once:
-[source,r]
-----
-x <- c(-10,3,1,-6,19,-3,12,-1)
-mysigns <- rep("Non-negative number", times=8)
-mysigns[x < 0] <- "Negative number"
-mysigns
-----
-
-====
-
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-
-
-**Context:** As we continue to become more familiar with `R` this project will help reinforce the many ways of indexing data in `R`.
-
-**Scope:** R, data.frames, indexing.
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-
-Using the *seminar-r* kernel
-Lets first see all of the files that are in the `craigslist` folder
-[source,r]
-----
-list.files("/anvil/projects/tdm/data/craigslist")
-----
-
-[NOTE]
-
-====
-Remember: +
-
-* If we want to see the file size (aka how large) of the CSV.
-[source,r]
-----
-file.info("/anvil/projects/tdm/data/craigslist/vehicles.csv")$size
-----
-
-* You can also use 'file.info' to see other information about the file.
-====
-
-After looking at several of the files we will go ahead and read in the data frame on the Vehicles
-[source,r]
-----
-myDF <- read.csv("/anvil/projects/tdm/data/craigslist/vehicles.csv", stringsAsFactors = TRUE)
-----
-
-It is important that, each time we look at data, we start by becoming familiar with the contents of the data. +
-In past projects we have looked at the head/tail along with the structure and the dimensions of the data. We want to continue this practice.
-
-This dataset has 25 columns. We are unable to see it all without adjusting the width. We can do this by
-[source,r]
-----
-options(repr.matrix.max.cols=25, repr.matrix.max.rows=200)
-----
-and we also remember (from the previous project) that we can set the output in `R` to look more natural this way:
-[source,r]
-----
-options(jupyter.rich_display = F)
-----
-
-
-[TIP]
-====
-- Use 'head' to look at the first 6 rows
-[source,r]
- head(myDF)
-- Use 'tail' to look at the last 6 rows
-[source, r]
- tail(myDF)
-- Use `str` to check structure
-[source, r]
- str(myDF)
-- Use `dim` to check dimensions
-[source, r]
- dim(myDF)
-
-To sort and order a single vector you can use this code:
-[source,r]
-----
-head(myDF$year[order(myDF$year)])
-----
-You can also use the `sort` function. By default, it sorts in ascending order. If want the order to be descending, use `decreasing = TRUE` as an argument
-[source,r]
-head(sort(myDF$year, decreasing = TRUE))
-====
-
-_**vectorization**_
-
-Most of R's functions are vectorized, which means that the function will be applied to all elements of a vector, without needing to loop through the elements one at a time. The most common way to access individual elements is by using the `[]` symbol for indexing.
-
-[NOTE]
-====
-[source,r]
-----
-cut(myvector, breaks = c(-Inf,10,50,200,Inf) , labels = c("a","b","c","d"))
-
-breaks value specified the range of myvector divided into the following intervals:
-- (-∞, 10)
-- [10, 50)
-- [50, 200)
-- [200, ∞)
-
-labels values will be assigned
-- Values less than 10: Will be labeled as "a".
-- Values in the range [10, 50): Will be labeled as "b".
-- Values in the range [50, 200): Will be labeled as "c".
-- Values 200 and above: Will be labeled as "d".
-----
-====
-
-== Questions
-
-=== Question 1 (1.5 pts)
-
-++++
-
-++++
-
-
-[upperalpha]
-.. How many unique states are there in total? Which five of the states have the most occurrences?
-.. How many cars have a price that is greater than or equal to $2000 ?
-.. What is the average price of the vehicles in the dataset?
-
-
-=== Question 2 (1.5 pts)
-
-++++
-
-++++
-
-[upperalpha]
-.. Create a new column `mileage_category` in your data.frame that categorize the vehicle's mileage into different buckets by using the `cut` function on the `odometer` column.
-... "Low": [0, 50000)
-... "Moderate": [50000, 100000)
-... "High": [100000, 150000)
-... "Very High": [150000, Inf)
-
-.. Create a new column called `has_VIN` that flags whether or not the listing Vehicle has a VIN provided.
-
-.. Create a new column called `description_length` to categorize listings based on the length of their descriptions (in terms of the number of characters).
-... "Very Short": [0, 50)
-... "Short": [50, 100)
-... "Medium": [100, 200)
-... "Long": [200, 500)
-... "Very Long": [500, Inf)
-
-[TIP]
-====
-You may count number of characters using the `nchar` function
-[source,r]
-mynchar <- nchar(as.character(myDF$description))
-====
-
-[NOTE]
-====
-Remember to consider _empty_ values and or `NA` values
-
-====
-
-=== Question 3 (1.5 pts)
-
-++++
-
-++++
-
-[upperalpha]
-.. Using the `table` function, and the new column `mileage_category` that you created in Question 2, find the number of cars in each of the different mileage categories.
-.. Using the `table` function, and the new column `has_VIN` that you created in Question 2, identify how many vehicles have a VIN and how many do not have a VIN.
-.. Using the `table` function, and the new column `description_length` that you created in Question 2, identify how many vehicles are in each of the categories of description length.
-
-
-=== Question 4 (1.5 pts)
-
-++++
-
-++++
-
-**Preparing for Mapping**
-//[arabic]
-[upperalpha]
-.. Extract all of the data for Texas into a data.frame called `myTexasDF`
-.. Identify the most popular state from myDF, and extract all of the data from that state into a data.frame called `popularStateDF`
-.. Create a third data.frame called `myFavoriteDF` with the data from a state of your choice
-
-
-=== Question 5 (2 pts)
-
-++++
-
-++++
-
-**Mapping**
-[upperalpha]
-.. Using the R package `leaflet`, make 3 maps of the USA, namely, one map for the data in each of the `data.frames` from question 4.
-
-
-=== Submitting your Work
-Well done, you've finished Project 4! Make sure that all of the below files are included in your submission, and feel free to come to seminar, post on Piazza, or visit some office hours if you have any further questions.
-
-Project 4 Assignment Checklist
-====
-- Code used to solve quesitons 1 to 5
-- All of your code and comments, and Output from running the code in a Jupyter Lab file:
- * `firstname-lastname-project04.ipynb`.
-- All of your code and comments in an R File:
- * `firstname-lastname-project04.R`.
-- submit files through Gradescope
-====
-
-[WARNING]
-====
-You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this.
-
-You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project05.adoc
deleted file mode 100644
index 0a5a6552b..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project05.adoc
+++ /dev/null
@@ -1,167 +0,0 @@
-= TDM 10100: Project 5 -- Fall 2023
-
-**Motivation:** `R` differs from other programing languages in that `R` works great with vectorized functions and the _apply_ suite of functions (instead of using loops).
-
-[NOTE]
-====
-The apply family of functions provide an alternative to loops. You can use *`apply()`* and its variants (i.e. mapply(), sapply(), lapply(), vapply(), rapply(), and tapply()...) to manipulate pieces of data from data.frames, lists, arrays, matrices in a repetitive way.
-====
-
-**Context:** We will focus in this project on efficient ways of processing data in `R`.
-
-**Scope:** tapply function
-
-.Learning Objectives
-****
-- Demonstrate the ability to use the `tapply` function.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset in Anvil:
-
-`/anvil/projects/tdm/data/election/escaped2020sample.txt`
-
-[NOTE]
-====
-A txt and csv file both store information in plain text. Data in *csv* files are almost always separated by commas. In *txt* files, the fields can be separated by commas, semicolons, pipe symbols, tabs, or other separators.
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-++++
-
-
-To read in a txt file in which the data is add sep="|" (see code below)
-[source,r]
-----
- myDF <- read.csv("/anvil/projects/tdm/data/election/escaped2020sample.txt", sep="|")
-----
-
-You might want to use 3 cores in this project when you setup your Jupyter Lab session.
-====
-=== `Data Understanding`
-
-The file uses '|' (instead of commas) to separate the data fields. The reason is that one column of data contains full names, which sometimes include commas.
-
-[source,r]
-head(myDF)
-
-When looking at the head of the data frame, notice that the entries in the `TRANSACTION_DT` column have the month, day, and year all crammed together without any slashes between them.
-
-=== `lubridate`
-
-The `lubridate` package can be used to put a column into a date format. In general, data that contains information about dates can sometimes be hard to put into a date format, but the `lubridate` package makes this easier.
-
-[source,r]
-----
-library(lubridate, warn.conflicts = FALSE)
-myDF$newdates <-mdy(myDF$TRANSACTION_DT)
-----
-A new column `newdates` is created, with the same data as the `TRANSACTION_DT` column but now stored in `date` format.
-
-Feel free to check out https://raw.githubusercontent.com/rstudio/cheatsheets/master/lubridate.pdf[the official cheatsheet] to learn more about the `lubridate` package.
-
-
-=== `tapply`
-
-*tapply()* helps us apply functions (for instance: mean, median, minimum, maximum, sum, etc...) to data, one group at a time. The *tapply()* function is most helpful when we need to break data into groups, applying a function to each of the groups of data.
-
-The `tapply` function takes three inputs:
-
-Some data to work on; a way to break the data into groups; and a function to apply to each group of data.
-
-[source, r]
-tapply(myDF$TRANSACTION_AMT, myDF$newdates, sum)
-
-* The `tapply` function applies can `sum` the `myDF$TRANSACTION_AMT` data, grouped according to `myDF$newdates`
-* Three inputs for tapply
-** `myDF$TRANSACTION_AMT`: the data vector to work on
-** `myDF$newdates`: the way to break the data into groups
-** `sum`: the function to apply on each piece of data
-
-== Questions
-
-
-=== Question 1 (1.5 pts)
-
-++++
-
-++++
-
-++++
-
-++++
-
-[upperalpha]
-.. Use the `year` function (from the `lubridate` library) on the column `newdates`, to create a new column named `TRANSACTION_YR`.
-.. Using `tapply`, add the values in the `TRANSACTION_AMT` column, according to the values in the `TRANSACTION_YR` column.
-.. Plot the years on the x-axis and the total amount of the transactions by year on the y-axis.
-
-=== Question 2 (1.5 pts)
-
-++++
-
-++++
-
-[upperalpha]
-.. From Question 1, you may notice that the majority of the data collected is found in the years 2019-2020. Please create a new dataframe that only contains data for the dates in the range 01/01/2020-12/31/2020.
-.. Using `tapply`, get the sum of the money in the `TRANSACTION_AMT` column, grouped according to the months January through December (in 2020 only).
-.. Plot the months on the x-axis and the total amount of the transactions (for each month) on the y-axis.
-
-=== Question 3 (1.5 pts)
-
-++++
-
-++++
-
-Let's go back to using the full set of data across all of the years (from Question 1). We can continue to experiment with the `tapply` function.
-
-[upperalpha]
-.. Please find the donor who gave the most money (altogether) in the whole data set.
-.. Find the total amount of money given (altogether) in each state. Then sort the states, according to the total amount of money given altogether. In which 5 states was the most money given?
-.. What are the ten zipcodes in which the most money is donated (altogether)?
-
-=== Question 4 (2 pts)
-
-++++
-
-++++
-
-[upperalpha]
-.. Using a `barplot` or `dotchart`, plot the total amount of money given in each of the top five states.
-.. Using a `barplot` or `dotchart`, plot the total amount of money given in each of the top ten zipcodes.
-
-=== Question 5 (1.5 pts)
-
-++++
-
-++++
-
-[upperalpha]
-.. Analyze something that you find interesting about the election data, make a plot to demonstrate your insight, and then explain your finding with a few sentences of explanation.
-
-Project 05 Assignment Checklist
-====
-* Jupyter Lab notebook with your code and comments for the assignment
- ** `firstname-lastname-project05.ipynb`.
-* R code and comments for the assignment
- ** `firstname-lastname-project05.R`.
-
-* Submit files through Gradescope
-====
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project06.adoc
deleted file mode 100644
index f3d7613a9..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project06.adoc
+++ /dev/null
@@ -1,143 +0,0 @@
-= TDM 10100: Project 6 -- Fall 2023
-Tapply, Tapply, Tapply
-
-**Motivation:** We want to have fun and get used to the function `tapply`
-
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/olympics/athlete_events.csv`
-- `/anvil/projects/tdm/data/death_records/DeathRecords.csv`
-
-== Questions
-
-=== Question 1 (1.5 pts)
-
-++++
-
-++++
-
-++++
-
-++++
-
-(We do not need the tapply function for Question 1)
-
-For this question, please read the dataset
-
-`/anvil/projects/tdm/data/olympics/athlete_events.csv`
-
-into a data frame called `myDF` as follows:
-
-[source, r]
-
-myDF <- read.csv("/anvil/projects/tdm/data/olympics/athlete_events.csv", stringsAsFactors=TRUE)
-
-[loweralpha]
-.. Use the `table` function to list all Games with occurrences in this data frame
-.. Use the `table` function to list all countries participating in the Olympics during the year 1980. (The output should exclude all countries that did not have any athletes in 1980.)
-.. Use the `subset` function to create a new data frame containing data related to athletes that attended the Olympics more than one time.
-
-(Use the original data frame `myDF` as a starting point for each of these three questions. Problems 1a and 1b and 1c are independent of each other. For instance, when you solve question 1c, do not restrict yourself to the year 1980.)
-
-[TIP]
-====
-For question 1c, use `duplicated` to identify duplicated elements, for example:
-
-[source, r]
-vec <- c(3, 2, 6, 5, 1, 1, 1, 6, 5, 6, 4, 3)
-
-[source, r]
-duplicated(vec)
-
-[source, r]
-FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
-
-====
-
-
-
-=== Question 2 (1.5 pts)
-
-Use the `tapply` command to solve each of these questions:
-
-[loweralpha]
-.. What is the average age of the participants from each country?
-.. What is Maximum Height by Sport? For your output on this question, please sort the Maximum Heights in decreasing order, and display the first 5 values.
-
-
-=== Question 3 (1 pt)
-
-++++
-
-++++
-
-For this question, save the data from the data set
-
-`/anvil/projects/tdm/data/death_records/DeathRecords.csv`
-
-into a new data frame called `myDF` as follows:
-
-[source, r]
-myDF <- read.csv("/anvil/projects/tdm/data/death_records/DeathRecords.csv", stringsAsFactors = TRUE)
-
-It might be helpful to get an overview of the structure of the data frame, by using the `str()` function:
-
-[source, r]
-str(myDF)
-
-[loweralpha]
-.. How many observations (i.e., rows) are given in this dataframe?
-.. Change the column `MonthOfDeath` from numbers to months
-.. How many people died (altogether) during each month? For instance, group together all of the deaths in January, all of the months in February, etc., so that you can display the total numbers from January to December in a total of 12 output values.
-
-[TIP]
-====
-You may factorize the month names with a specified level order:
-[source, r]
-month_order <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
-myDF$MonthOfDeath <- factor(myDF$MonthOfDeath)
-levels(myDF$MonthOfDeath) <- month_order
-====
-
-=== Question 4 (2 pts)
-
-++++
-
-++++
-
-[loweralpha]
-.. For each race, what is the average age at the time of death? Use the `race` column, which has integer values, and sort your outputs into descending order.
-.. Now considering only data for females: for each race, what is the average age at the time of death? Now considering only data for males, we can ask the same question: for each race, what is the average age at the time of death?
-
-If you want to see the list of race values from the CDC for this data, you can look at page 15 of this pdf file:
-
-https://www.cdc.gov/nchs/data/dvs/Record_Layout_2014.pdf
-
-If you want to (this is optional!) you can use the method we used in question 3B to convert integer values into the string values that describe each race. This is not required but you are welcome to do this, if you want to.
-
-=== Question 5 (2 pts)
-
-[loweralpha]
-.. Using the data set about the Olympic athletes, create a graph or plot that you find interesting. Write 1-2 sentences about something you found interesting about the data set; explain what you noticed in the dataset.
-.. Using the data set about the death records, create a graph or plot that you find interesting. Write 1-2 sentences about something you found interesting about the data set; explain what you noticed in the dataset.
-
-Project 06 Assignment Checklist
-====
-* Jupyter Lab notebook with your code and comments for the assignment
- ** `firstname-lastname-project06.ipynb`.
-* R code and comments for the assignment
- ** `firstname-lastname-project06.R`.
-
-* Submit files through Gradescope
-====
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project07.adoc
deleted file mode 100644
index 09274512e..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project07.adoc
+++ /dev/null
@@ -1,193 +0,0 @@
-= TDM 10100: Project 7 -- 2023
-
-**Motivation:** A couple of bread-and-butter functions that are a part of the base R are: `subset`, and `merge`. `subset` provides a more natural way to filter and select data from a data.frame. `merge` brings the principals of combining data that SQL uses, to R.
-
-**Context:** We've been getting comfortable working with data in within the R environment. Now we are going to expand our tool set with these useful functions, all the while gaining experience and practice wrangling data!
-
-**Scope:** r, subset, merge, tapply
-
-.Learning Objectives
-****
-- Gain proficiency using split, merge, and subset.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Demonstrate how to use tapply to solve data-driven problems.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/icecream/combined/products.csv`
--` /anvil/projects/tdm/data/icecream/combined/reviews.csv`
-- `/anvil/projects/tdm/data/movies_and_tv/titles.csv`
-- `/anvil/projects/tdm/data/movies_and_tv/episodes.csv`
-- `/anvil/projects/tdm/data/movies_and_tv/ratings.csv`
-
-== Questions
-
-[IMPORTANT]
-====
-Please select 3 cores when launching JupyterLab for this project.
-====
-
-Data can come in a lot of different formats and from a lot of different locations. It is common to have several files that need to be combined together, before analysis is performed. The `merge` function is helpful for this purpose. The way that we merge files is different in each language and data science too. With R, there is a built-in `merge` function that makes things easy! (Of course students in TDM 10100 have not yet learned about SQL databases, but many of you will learn SQL databases someday too. The `merge` function is very similar to the ability to `merge` tables in SQL databases.)
-
-++++
-
-++++
-
-[NOTE]
-====
-Read the data in using the following code. We used `read.csv` for this purpose in the past. The `fread` function is a _much faster_ and more efficient way to read in data.
-
-[source,r]
-----
-library(data.table)
-
-products <- fread("/anvil/projects/tdm/data/icecream/combined/products.csv")
-reviews <- fread("/anvil/projects/tdm/data/icecream/combined/reviews.csv")
-titles <- fread("/anvil/projects/tdm/data/movies_and_tv/titles.csv")
-episodes <- fread("/anvil/projects/tdm/data/movies_and_tv/episodes.csv")
-ratings <- fread("/anvil/projects/tdm/data/movies_and_tv/ratings.csv")
-====
-
-[WARNING]
-====
-Please remember to run the `library(data.table)` line, before you use the `fread` function. Otherwise, you will get an error in a pink box in JupyterLab like this:
-
-Error in fread: could not find function "fread"
-====
-
-=== Question 1 (1 pt)
-
-++++
-
-++++
-
-++++
-
-++++
-
-
-We will use the `products` data.frame for this question.
-
-[loweralpha]
-.. What are all the different ingredients in the first record of `products`?
-.. Consider the `rating` column and the `ingredients` column. Consider only the products in which "GUAR GUM" is one of the ingredients. (You will need to use either `grep` or `grepl` or something similar, to find these products. Hint: You should have 85 such products.) List the ratings for these 85 products, in decreasing order.
-
-Please find out the distribution of ratings for the ice cream which ingredients include "GUAR GUM", display the result in descending order
-
-
-=== Question 2 (1 pt)
-
-++++
-
-++++
-
-We will use the `products` and `reviews` data.frames for this question.
-
-[loweralpha]
-.. Use the `brand` and `key` columns from both the `products` data.frame and `reviews` data.frame to `merge` the two data frames. This will give a new data.frame that contains the product details and their associated reviews.
-
-[TIP]
-====
-If you do not specify the `brand` and `key` columns for the `merge`, then you will get an error, because the ingredients function contains characters in the `products` data frame but contains numeric values in the `reviews` data frame.
-====
-
-
-[TIP]
-====
-* The `merge` function in `R` allows two data frames to be combined by common columns. This function allows the user to combine data similar to the way `SQL` would using `JOIN`s. https://www.codeproject.com/articles/33052/visual-representation-of-sql-joins[Visual representation of SQL Joins]
-* This is also a really great https://www.datasciencemadesimple.com/join-in-r-merge-in-r/[explanation of merge in `R`].
-====
-
-=== Question 3 (3 pts)
-
-++++
-
-++++
-
-++++
-
-++++
-
-
-We will use the `episodes`, `titles` and `ratings` data.frames for questions 3 through Question 5
-
-[loweralpha]
-.. Use `merge` (a few times) to create a new data.frame that contains at least the following four columns for **only** the episodes of the show called "Stranger Things". The show itself called "Stranger Things" has a `title_id` of tt4574334. You can find this on IMDB here: https://www.imdb.com/title/tt4574334/ Each episode of Stranger Things has its own `title_id` that contains the information for the specific episode as well. For your output: Show the top 5 rows of your final data.frame, containing the top 5 rated episodes of Stranger Things.
-
-- The `primary_title` of the **show itself** -- call it `show_title`.
-- The `primary_title` of the **episode** -- call it `episode_title`.
-- The `rating` of the **show itself** -- call it `show_rating`.
-- The `rating` of the **episode** -- call it `episode_rating`.
-
-[TIP]
-====
-Start by getting a subset of the `episodes` table that contains only information for the show Stranger Things. To do this, you will need to make a subset of the data frame that only has information for Stranger Things show. That way, we aren't working with as much data.
-====
-
-Make sure to show the top 5 rows of your final data.frame, containing the top 5 rated episodes of Stranger Things!
-
-[NOTE]
-====
-In the videos, I did not rename the columns. You might want to rename them, because it might help you, but you do not need to rename them. It's up to you. I'm trying to be a little flexible and to provide guidance without being too strict either.
-====
-
-=== Question 4 (1 pt)
-
-++++
-
-++++
-
-For question 4, use the data frame that you built in Question 3.
-
-[loweralpha]
-.. Use regular old indexing to find all episodes of "Stranger Things" with an `episode_rating` less than 8.5 and `season_number` of exactly 3.
-.. Repeat the process, but this time use the `subset` function instead.
-
-Make sure that the dimensions of the data frames that you get in question 4a and 4b are the same sizes!
-
-=== Question 5 (2 pts)
-
-++++
-
-++++
-
-For question 5, use the data frame that you built in Question 3.
-
-The `subset` function allows you to index data.frame's in a less verbose manner. Read https://the-examples-book.com/programming-languages/R/subset[this].
-
-While it maybe appears to be a clean way to subset data, I'd suggest avoiding it over explicit long-form indexing. Read http://adv-r.had.co.nz/Computing-on-the-language.html[this fantastic article by Dr. Hadley Wickham on non-standard evaluation]. Take for example, the following (a bit contrived) example using the dataframe we got in question (3).
-
-Note: You do not need to write much for your answer. It is OK if you try the example below, and you see that it fails (and it will fail for sure!), and then you say something like, "I will try hard to not use variable names that overlap with other variable names". Or something like that! We simply want to ensure that students are choosing to use good variable names.
-
-[source,r]
-----
-season_number <- 3
-subset(StrangerThingsBigMergedDF, (season_number == season_number) & (rating.y < 8.5))
-----
-[loweralpha]
-.. Read that provided article and do your best to explain _why_ `subset` gets a different result than our example that uses regular indexing.
-
-
-Project 07 Assignment Checklist
-====
-* Jupyter Lab notebook with your code, comments and output for the assignment
- ** `firstname-lastname-project07.ipynb`.
-* R code and comments for the assignment
- ** `firstname-lastname-project07.R`.
-
-* Submit files through Gradescope
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project08.adoc
deleted file mode 100644
index c7104c9eb..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project08.adoc
+++ /dev/null
@@ -1,174 +0,0 @@
-= TDM 10100: Project 8 -- 2023
-
-**Motivation:** Functions are an important part of writing efficient code. +
-Functions allow us to repeat and reuse code. If you find you using a set of coding steps over and over, a function may be a good way to reduce your lines of code!
-
-**Context:** We've been learning about and using functions these last few weeks. +
-To learn how to write your own functions we need to learn some of the terminology and components.
-
-**Scope:** r, functions
-
-.Learning Objectives
-****
-- Gain proficiency using split, merge, and subset.
-- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
-- Read and write basic (csv) data.
-- Comprehend what a function is, and the components of a function in R.
-****
-
-== Dataset(s)
-
-We will use the same dataset(s) as last week:
-
-- `/anvil/projects/tdm/data/icecream/combined/products.csv`
-- `/anvil/projects/tdm/data/icecream/combined/reviews.csv`
-
-[IMPORTANT]
-====
-Please choose 3 cores when launching the JupyterLab for this project.
-====
-
-[NOTE]
-====
-`fread`- is a fast and efficient way to read in data.
-
-[source,r]
-----
-library(data.table)
-
-products <- fread("/anvil/projects/tdm/data/icecream/combined/products.csv")
-reviews <- fread("/anvil/projects/tdm/data/icecream/combined/reviews.csv")
-----
-====
-[WARNING]
-====
-Please remember to run the `library(data.table)` line, before you use the `fread` function. Otherwise, you will get an error in a pink box in JupyterLab like this:
-
-Error in fread: could not find function "fread"
-====
-
-We will see how to write our own function, so that we can make a repetitive operation easier, by turning it into a single command. +
-
-We need to take care to name the function something concise but meaningful, so that other users can understand what the function does. +
-
-Function parameters can also be called formal arguments.
-
-[NOTE]
-====
-A function contains multiple interrelated statements. We can "call" the function, which means that we run all of the statements from the function. +
-
-Functions can be built-in or can be created by the user (user-defined). +
-
-.Some examples of built in functions are:
-
-* `min()`, `max()`, `mean()`, `median()`
-* `print()`
-* `head()`
-
-
-Syntax of a function
-[source, R]
-----
-what_you_name_the_function <- function (parameters) {
- statement(s) that are executed when the function runs
- the last line of the function is the returned value
-}
-----
-====
-
-== Questions
-
-=== Question 1 (2 pts)
-
-++++
-
-++++
-
-++++
-
-++++
-
-
-To gain better insights into our data, let's make two simple plots. The following are two examples. You can create your own plots.
-
-[loweralpha]
-.. In project 07, you found the different ingredients for the first record in the `products` data frame. We may get all of the ingredients from the `products` data frame, and find the top 10 most frequently used ingredients. Then we can create a bar chart for the distribution of the number of times that each ingredient appears.
-.. A line plot to visualize the distribution of the reviews of the products.
-.. What information are you gaining from these graphs?
-[TIP]
-====
-The `table` function can be useful to get the distribution of the number of times that each ingredient appears.
-
-This is a good website for bar plot examples: https://www.statmethods.net/graphs/bar.html
-
-This is a good website for line plot examples: http://www.sthda.com/english/wiki/line-plots-r-base-graphs
-====
-
-Making a `dotchart` for Question 1 is helpful and insightful, as demonstrated in the video. BUT we also want you to see how to make a bar plot and a line plot. Do not worry about the names of the ingredients too much. If only a few names of ingredients appear on the x-axis for Question 1, that is OK wiht us. We just want to show the distribution (in other words, the numbers) of times that items appear. We are less concerned about the item names themselves.
-
-
-=== Question 2 (1 pt)
-
-For practice, now that you have a basic understanding of how to make a function, we will use that knowledge, applied to our dataset.
-
-Here are pieces of a function we will use on this dataset; products, reviews and products' rating put them in the correct order +
-[source,r]
-* merge_results <- merge(products_df, reviews_df, by="key")
-* }
-* function(products_df, reviews_df, myrating)
-* return(products_reviews_results)
-* {
-* products_reviews_results <- merge_results[merge_results$rating >= myrating, ]
-* products_reviews_by_rating <-
-
-
-=== Question 3 (1 pt)
-
-
-Take the above function and add comments explaining what the function does at each step.
-
-=== Question 4 (2 pts)
-
-[source,r]
-----
-my_selection <- products_reviews_by_rating(products, reviews, 4.5)
-----
-
-Use the code above, to answer the following question. We want you to use the data frame `my_selection` when solving Question 4. (Do not use the full `products` data frame for Question 4.)
-
-[loweralpha]
-.. How many products are there (altogether) that have rating at least 4.5? (This is supposed to be simple: You can just find the number of rows of the data frame `my_selection`.)
-
-
-[TIP]
-====
-The function merged two data sets products and reviews. Both of them have an `ingredients` column, so we need to use the `ingredients` column from `products` by referring to`ingredients.x`.
-====
-
-=== Question 5 (2 pts)
-
-For Question 5, go back to the full `products` data frame. (In other words, do not limit yourself to `my_selection` any more.) When you are constructing your function in part a, it should be helpful to review the videos from Question 1.
-
-[loweralpha]
-.. Now create a function that takes 1 ingredient as the input, and finds the number of products that contain that ingredient.
-.. Use your function to determine how many products contain SALT as an ingredient.
-
-(Note: If you test the function with "GUAR GUM", for instance, you will see that there are 85 products with "GUAR GUM" as an ingredient, as we learned in the previous project.)
-
-
-Project 08 Assignment Checklist
-====
-* Jupyter Lab notebook with your code, comments and output for the assignment
- ** `firstname-lastname-project08.ipynb`
-* R code and comments for the assignment
- ** `firstname-lastname-project08.R`.
-
-* Submit files through Gradescope
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project09.adoc
deleted file mode 100644
index 7df6484dc..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project09.adoc
+++ /dev/null
@@ -1,169 +0,0 @@
-= TDM 10100: Project 9 -- 2023
-:page-mathjax: true
-
-Benford's Law
-
-**Motivation:**
-https://en.wikipedia.org/wiki/Benford%27s_law[Benford's law] has many applications, including its well known use in fraud detection. It also helps detect anomalies in naturally occurring datasets.
-[NOTE]
-====
-* You may get more information about Benford's law from the following link
-https://www.kdnuggets.com/2019/08/benfords-law-data-science.html["What is Benford's Law and Why is it Important for Data Science"]
-====
-
-**Scope:** 'R' and functions
-
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-* `/anvil/projects/tdm/data/restaurant/orders.csv`
-
-[NOTE]
-====
-A txt and csv file both store information in plain text. csv files are always separated by commas. In txt files the fields can be separated with commas, semicolons, or tabs.
-
-[source,r]
-----
-myDF <- read.csv("/anvil/projects/tdm/data/restaurant/orders.csv")
-----
-====
-
-== Questions
-
-https://www.statisticshowto.com/benfords-law/[Benford's law] (also known as the first digit law) states that the leading digits in a collection of datasets will most likely be small. +
-It is basically a https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/probability-distribution/[probability distribution] that gives the likelihood of the first digit occurring, in a set of numbers.
-
-Another way to understand Benford's law is to know that it helps us assess the relative frequency distribution for the leading digits of numbers in a dataset. It states that leading digits with smaller values occur more frequently.
-
-[NOTE]
-====
-A probability distribution helps define what the probability of an event happening is. It can be simple events like a coin toss, or it can be applied to complex events such as the outcome of drug treatments etc. +
-
-* Basic probability distributions which can be shown on a probability distribution table.
-* Binomial distributions, which have “Successes” and “Failures.”
-* Normal distributions, sometimes called a Bell Curve.
-
-Remember that the sum of all the probabilities in a distribution is always 100% or 1 as a decimal.
-
-This law only works for numbers that are *significand S(x)* which means any number that is set into a standard format. +
-
-To do this you must
-
-* Find the first non-zero digit
-* Move the decimal point to the right of that digit
-* Ignore the sign
-
-An example would be 9087 and -.9087 both have the *S(x)* as 9.087
-
-It can also work to find the second, third and succeeding numbers. It can also find the probability of certain combinations of numbers. +
-
-Typically this law does not apply to data sets that have a minimum and maximum (restricted). This law does not apply to datasets if the numbers are assigned (i.e. social security numbers, phone numbers etc.) and are not naturally occurring numbers. +
-
-Larger datasets and data that ranges over multiple orders of magnitudes from low to high work well using Bedford's law.
-====
-
-++++
-
-++++
-
-Benford's law is given by the equation below.
-
-$P(d) = \dfrac{\ln((d+1)/d)}{\ln(10)}$
-
-$d$ is the leading digit of a number (and $d \in \{1, \cdots, 9\}$)
-
-An example the probability of the first digit being a 1 is
-
-$P(1) = \dfrac{\ln((1+1)/1)}{\ln(10)} = 0.301$
-
-The following is a function implementing Benford's law
-[source, r]
-benfords_law <- function(d) log10(1+1/d)
-
-To show Benfords_law in a line plot
-[source, r]
-digits <-1:9
-bf_val<-benfords_law(digits)
-plot(digits, bf_val, xlab = "digits", ylab="probabilities", main="Benfords Law Plot Line")
-
-
-=== Question 1 (1 pt)
-
-++++
-
-++++
-
-[loweralpha]
-
-.. Create a plot (could be a bar plot, line plot, scatter plot, etc., any type of plot is OK) to show Benfords's Law for probabilities of digits from 1 to 9.
-
-=== Question 2 (1 pt)
-
-++++
-
-++++
-
-.. Create a function called `first_digit` that takes an argument `number`, and extracts the first non-zero digit from the number
-
-=== Question 3 (2 pts)
-
-++++
-
-++++
-
-.. Read in the restaurant orders data `/anvil/projects/tdm/data/restaurant/orders.csv` into a dataset named `myDF`.
-
-.. Create a vector `fd_grand_total` by using `sapply` with your function `first_digit` from question 2 on the `grand_total` column in your `myDF` dataframe
-
-
-=== Question 4 (2 pts)
-
-++++
-
-++++
-
-++++
-
-++++
-
-.. Calculate the actual distribution of digits in `fd_grand_total`
-.. Plot the output actual distribution (again, could be a bar plot, line plot, dot plot, etc., anything is OK). Does it look like it follows Benford's law? Explain briefly.
-
-[TIP]
-====
-use `table` to get summary times of digits then divide by `length` of the vector fd_grand_total
-====
-
-=== Question 5 (2 pts)
-
-++++
-
-++++
-
-.. Create a function that will return a new data frame `orders_by_dates` from the `myDF` that looks at the `delivery_date` column to compare with two arguments `start_date` and `end_date`. If the `delivery_date` is in between, then add record to the new data frame.
-.. Run the function for a certain period, and display some orders with the `head` function
-
-[TIP]
-`as.Date` will be useful to do conversion in order to compare dates
-
-
-Project 09 Assignment Checklist
-====
-* Jupyter Lab notebook with your code, comments and output for the assignment
- ** `firstname-lastname-project09.ipynb`.
-* R code and comments for the assignment
- ** `firstname-lastname-project09.R`.
-
-* Submit files through Gradescope
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project10.adoc
deleted file mode 100644
index e64de3579..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project10.adoc
+++ /dev/null
@@ -1,102 +0,0 @@
-= TDM 10100: Project 10 -- 2023
-Creating functions and using `tapply` and `sapply`
-
-**Motivation:** As we have learned, functions are foundational to more complex programs and behaviors. +
-There is an entire programming paradigm based on functions called https://en.wikipedia.org/wiki/Functional_programming[functional programming].
-
-**Context:**
-We will apply functions to entire vectors of data using `tapply` and `sapply`. We learned how to create functions, and now the next step we will take is to use it on a series of data.
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The project will use the following dataset(s):
-
-* `/anvil/projects/tdm/data/restaurant/orders.csv`
-* `/anvil/projects/tdm/data/restaurant/vendors.csv`
-
-[NOTE]
-====
-The read.csv() function automatically delineates by a comma`,` +
-You can use other delimiters by using adding the `sep` argument +
-i.e. `read.csv(...sep=';')` +
-
-You can also load the `data.table` library and use the `fread` function.
-====
-
-
-== Questions
-
-=== Question 1 (2 pts)
-
-++++
-
-++++
-
-
-Please load the datasets into data frames named `orders` and `vendors`
-
-There are many websites that explain how to use `grep` and `grepl` (the `l` stands for `logical`) to search for patterns. See, for example: https://statisticsglobe.com/grep-grepl-r-function-example
-
-.. Use the `grepl` function and the `subset` function to make a new data frame from `vendors`, containing only the rows with "Fries" in the column called `vendor_tag_name`.
-
-.. Now use the `grep` function and row indexing, to make a data frame from `vendors` that (as before) contains only the rows with "Fries" in the column called `vendor_tag_name`.
-
-.. Verify that your data frames in questions 1a and 1b are the same size.
-
-=== Question 2 (2 pts)
-
-++++
-
-++++
-
-.. In the data frame `vendors`, there are two types of `delivery_charge` values: 0 (which represented free delivery) and 0.7 (which represents non-free delivery). Make a table that shows how many of each type of value there are in the `delivery_charge` column.
-.. Please use the `prop.table` function to convert these counts into percentages.
-
-=== Question 3 (2 pts)
-
-++++
-
-++++
-
-.. Consider only the vendors with `vendor_category_id == 2`. Among these vendors, find the percentages of the `delivery_charge` column that are 0 (free delivery) and 0.7 (non-free delivery).
-.. Now consider only the vendors with `vendor_category_id == 3`, and again find the percentages of the `delivery_charge` column that are 0 (free delivery) and 0.7 (non-free delivery).
-
-=== Question 4 (1 pt)
-
-++++
-
-++++
-
-.. Solve questions 3a and 3b again, but this time, solve these two questions with one application of the `tapply` command, which provides the answers to both questions. (It is fine to give only the counts here, in question 4a, and convert the counts to percentages in question 4b.)
-
-.. Now (instead) use an user-defined function inside the `tapply` to convert your answer from counts into percentages.
-
-=== Question 5 (1 pt)
-
-++++
-
-++++
-
-.. Starting with your solution to question 4a, now use the `sapply` command to convert your answer from counts into percentages. Your solution should agree with the percentages that you found in question 4b.
-
-
-
-Project 10 Assignment Checklist
-====
-* Jupyter Lab notebook with your code, comments and output for the assignment
- ** `firstname-lastname-project10.ipynb`
-* R code and comments for the assignment
- ** `firstname-lastname-project10.R`.
-
-* Submit files through Gradescope
-====
-
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project11.adoc
deleted file mode 100644
index 50433c406..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project11.adoc
+++ /dev/null
@@ -1,92 +0,0 @@
-= TDM 10100: Project 11 -- 2023
-
-**Motivation:** Selecting the right tools, understanding a problem and knowing what is available to support you takes practice. +
-So far this semester we have learned multiple tools to use in `R` to help solve a problem. This project will be an opportunity for you to choose the tools and decide how to solve the problem presented.
-
-We will also be looking at `Time Series` data. This is a way to study the change of one or more variables through time. Data visualizations help greatly in looking at Time Series data.
-
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The project will use the following dataset:
-
-* `/anvil/projects/tdm/data/restaurant/orders.csv`
-
-== Questions
-
-=== Question 1 (2 pts)
-
-++++
-
-++++
-
-
-Read in the dataset `/anvil/projects/tdm/data/restaurant/orders.csv` into a data.frame named `orders`
-
-[loweralpha]
-.. Convert the `created_at` column to month, date, year format
-.. How many unique years are in the data.frame ?
-.. Create a line plot that shows the average number of orders placed per day of the week ( e.g. Monday, Tuesday ...).
-.. Write one to two sentences on what you notice in the graph
-
-=== Question 2 ( 2 pts)
-
-++++
-
-++++
-
-
-[loweralpha]
-.. Identify the top 5 vendors (vendor_id) with the highest number of orders over the years (based on `created_at` for time reference)
-.. For these top 5 vendors, determine the average grand_total amount for the orders they received each year
-.. Comment on any interesting patterns you observe, regarding the average total amount across these vendors, and how that changed over the years.
-
-[NOTE]
-====
-You can use either `tapply` OR the `aggregate` function to group or summarize data
-====
-
-=== Question 3 (2 pts)
-
-++++
-
-++++
-
-
-
-.. Using the `created_at` field, try to find out how many orders are placed after 5 pm, and how many orders are placed before 5 pm?
-.. Create a bar chart that compares the number of orders placed after 5 pm with the number of orders before 5 pm, for each day of the week
-
-[NOTE]
-====
-You can use the library `ggplot2` for this question.
-
-You may get more information about ggplot2 from here: https://ggplot2.tidyverse.org
-====
-
-=== Question 4 (2 pts)
-
-Looking at the data, is there something that you find interesting?
-Create 3 new graphs, and explain what you see, and why you chose each specific type of plot.
-
-
-Project 11 Assignment Checklist
-====
-* Jupyter Lab notebook with your code, comments and output for the assignment
- ** `firstname-lastname-project11.ipynb`
-* R code and comments for the assignment
- ** `firstname-lastname-project11.R`.
-
-* Submit files through Gradescope
-====
-
-
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project12.adoc
deleted file mode 100644
index 16afd55e3..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project12.adoc
+++ /dev/null
@@ -1,86 +0,0 @@
-= TDM 10100: Project 12 -- 20223
-
-**Motivation:**
-In the previous project we manipulated dates, this project we are going to continue to work with dates.
-Working with dates in `R` can require more attention than working with other object classes. These packages will help simplify some of the common tasks related to date data. +
-
-Dates and times can be complicated. For instance, not every year has 365 days. Dates are difficult because they have to accommodate for the Earth's rotation and orbit around the sun. We need to handle timezones, daylight savings, etc.
-If suffices to say that, when focusing on dates and date-times in R, the simpler the better.
-
-.Learning Objectives
-****
-- Read and write basic (csv) data.
-- Explain and demonstrate: positional, named, and logical indexing.
-- Utilize apply functions in order to solve a data-driven problem.
-- Gain proficiency using split, merge, and subset.
-- Demonstrate the ability to create basic graphs with default settings.
-- Demonstrate the ability to modify axes labels and titles.
-- Incorporate legends using legend().
-- Demonstrate the ability to customize a plot (color, shape/linetype).
-- Work with dates in a variety of ways.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The project will use the following dataset:
-
-* `/anvil/projects/tdm/data/restaurant/orders.csv`
-
-== Questions
-
-Go ahead and use the `fread` function from the `data.table` library, to read in the dataset to a data frame called `orders`.
-
-=== Question 1 (2 pts)
-
-++++
-
-++++
-
-
-[loweralpha]
-. Use the `substr` function to get (only) the month-and-year of each date in the `created_at` column. How many times does each month-and-year pair occur? You may find more information about the `substr` function here: https://www.digitalocean.com/community/tutorials/substring-function-in-r#[R substring]
-. Now (instead) use the `month` function and the `year` function on the `created_at` column, and make sure that your results agree with the results from 1a.
-. Finally, use the `format` function to extract the month-and-year pairs from the `created_at` column, and make sure that your results (again!) agree with the results from 1a.
-
-
-=== Question 2 (2 pts)
-
-++++
-
-++++
-
-[loweralpha]
-. Which `customer_id` placed the largest number of orders altogether? (Each row of the data set represents exactly one order.)
-. For the `customer_id` that you found in question 2a, either use the `subset` function or use indexing to find the month-and-year pair in which that customer placed the most orders.
-
-=== Question 3 (2 pts)
-
-[loweralpha]
-. There are 5 types of payments in the `payment_mode` column. How many times are each of these 5 types of payments used in the data set?
-. If we focus on the `customer_id` found in question 2a, which type of payment does that customer prefer? How many times did that customer use each of the 5 types of payments?
-
-=== Question 4 (2 pts)
-
-[loweralpha]
-. Use the `subset` function to make a data frame called `ordersJan2020` that contains only the orders from January 2020.
-. Create a plot using the `ordersJan2020` data that shows the sum of the `grand_total` values for each of the 7 days of the week.
-
-
-
-Project 12 Assignment Checklist
-====
-* Jupyter Lab notebook with your code, comments and output for the assignment
- ** `firstname-lastname-project12.ipynb`
-* R code and comments for the assignment
- ** `firstname-lastname-project12.R`.
-* Submit files through Gradescope
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project13.adoc
deleted file mode 100644
index ddca85d74..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project13.adoc
+++ /dev/null
@@ -1,84 +0,0 @@
-= TDM 10100: Project 13 -- 2023
-
-**Motivation:** This semester we took a deep dive into `R` and its packages. Let's take a second to pat ourselves on the back for surviving a long semester and review what we have learned!
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The project will use the following dataset:
-
-* `/anvil/projects/tdm/data/icecream/combined/products.csv`
-
-== Questions
-
-=== Questions 1 (2 pts)
-
-For question 1, read the dataset into a data.frame called `orders`
-
-[loweralpha]
-.. Create a plot that shows, for each `brand` of ice cream, the total number of `rating_count` (in other words, the `sum` of those `rating_count` values) for each `brand` of icecream. There are 4 brands, so your solution should have 4 values altogether.
-
-[TIP]
-====
-- It might be worthwhile to make a dotchart.
-
-====
-
-Before solving Question 2, please build a data frame called `bigDF` from these three files
-
-`/anvil/projects/tdm/data/icecream/bj/reviews.csv`
-
-`/anvil/projects/tdm/data/icecream/breyers/reviews.csv`
-
-`/anvil/projects/tdm/data/icecream/talenti/reviews.csv`
-
-using this code:
-
-[source,bash]
-----
-mybrands <- c("bj", "breyers", "talenti")
-myfiles <- paste0("/anvil/projects/tdm/data/icecream/", mybrands, "/reviews.csv")
-bigDF <- do.call(rbind, lapply(myfiles, fread))
-----
-
-Use this data frame `bigDF` to answer Questions 2, 3, and 4:
-
-
-=== Question 2 (2 pts)
-
-[loweralpha]
-.. In which month-and-year pair were the most reviews given? (There is one review per line of this data frame `bigDF`.
-.. Make a plot that shows, for each year, the average number of stars in that year.
-
-=== Question 3 (2 pts)
-
-[loweralpha]
-.. Which key has the lowest average number of stars?
-.. There is one entry in which the text review has more than 2500 characters! Print the text of this review.
-
-=== Question 4 (2 pts)
-
-[loweralpha]
-.. Consider all of the authors of the reviews. Which author wrote the most reviews altogether? (Note: there are many blank authors, and there are a lot of Anonymous authors, but please ignore blank authors and Anonymous authors in this question.)
-.. Considering the 43 reviews written by the author that you found in question 4a, this author is usually happy and gives high ratings. BUT this author gave one review that only had 1 star. Print the text of that 1 star review from the author you found in question 4a.
-
-
-
-
-Project 13 Assignment Checklist
-====
-* Jupyter Lab notebook with your code, comments and output for the assignment
- ** `firstname-lastname-project13.ipynb`
-* R code and comments for the assignment
- ** `firstname-lastname-project13.R`.
-
-* Submit files through Gradescope
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project14.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project14.adoc
deleted file mode 100644
index d5317a6f8..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project14.adoc
+++ /dev/null
@@ -1,53 +0,0 @@
-= TDM 10100: Project 14 -- Fall 2023
-
-**Motivation:** We covered a _lot_ this year! When dealing with data driven projects, it is crucial to thoroughly explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance. As this is our final project for the semester, its primary purpose is survey based. You will answer a few questions mostly by revisiting the projects you have completed.
-
-**Context:** We are on the last project where we will revisit our previous work to consolidate our learning and insights. This reflection also help us to set our expectations for the upcoming semester
-
-**Scope:** R, Jupyter Lab, Anvil
-
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Questions
-
-
-=== Question 1 (1 pt)
-
-.. Reflecting on your experience working with different datasets, which one did you find most enjoyable, and why? Discuss how this dataset's features influenced your analysis and visualization strategies. Illustrate your explanation with an example from one question that you worked on, using the dataset.
-
-=== Question 2 (1 pt)
-
-.. Reflecting on your experience working with different commands, functions, and packages, which one is your favorite, and why do you enjoy learning about it? Please provide an example from one question that you worked on, using this command, function, or package.
-
-
-=== Question 3 (1 pt)
-
-.. Reflecting on data visualization questions that you have done, which one do you consider most appealing? Please provide an example from one question that you completed. You may refer to the question, and screenshot your graph.
-
-=== Question 4 (2 pts)
-
-.. While working on the projects, including statistics and testing, what steps did you take to ensure that the results were right? Please illustrate your approach using an example from one problem that you addressed this semester.
-
-=== Question 5 (1 pt)
-
-.. Reflecting on the projects that you completed, which question(s) did you feel were most confusing, and how could they be made clearer? Please use a specific question to illustrate your points.
-
-=== Question 6 (2 pts)
-
-.. Please identify 3 skills or topics related to the R language that you want to learn. For each, please provide an example that illustrates your interests, and the reason that you think they would be beneficial.
-
-
-Project 14 Assignment Checklist
-====
-* Jupyter Lab notebook with your answers and examples. You may just use markdown format for all questions.
- ** `firstname-lastname-project14.ipynb`
-* Submit files through Gradescope
-====
-
-WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
\ No newline at end of file
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-projects.adoc
deleted file mode 100644
index 7fd078360..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-projects.adoc
+++ /dev/null
@@ -1,45 +0,0 @@
-= TDM 10100
-
-xref:fall2023/logistics/office_hours_101.adoc[[.custom_button]#TDM 101 Office Hours#]
-xref:fall2023/logistics/101_TAs.adoc[[.custom_button]#TDM 101 TAs#]
-xref:fall2023/logistics/syllabus.adoc[[.custom_button]#Syllabus#]
-
-== Project links
-
-[NOTE]
-====
-Only the best 10 of 14 projects will count towards your grade.
-====
-
-[CAUTION]
-====
-Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses.
-====
-
-[%header,format=csv,stripes=even,%autowidth.stretch]
-|===
-include::ROOT:example$10100-2023-projects.csv[]
-|===
-
-[WARNING]
-====
-Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete.
-
-**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information.
-
-Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza.
-====
-
-== Piazza
-
-=== Sign up
-
-https://piazza.com/purdue/fall2022/tdm10100[https://piazza.com/purdue/fall2023/tdm10100]
-
-=== Link
-
-https://piazza.com/purdue/fall2022/tdm10100/home[https://piazza.com/purdue/fall2023/tdm10100/home]
-
-== Syllabus
-
-See xref:fall2023/logistics/syllabus.adoc[here].
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project01.adoc
deleted file mode 100644
index d475e7357..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project01.adoc
+++ /dev/null
@@ -1,388 +0,0 @@
-= TDM 20100: Project 1 -- 2023
-
-**Motivation:** It’s been a long summer! Last year, you got some exposure to both R and Python. This semester, we will venture away from R and Python, and focus on UNIX utilities like `sort`, `awk`, `grep`, and `sed`. While Python and R are extremely powerful tools that can solve many problems — they aren’t always the best tool for the job. UNIX utilities can be an incredibly efficient way to solve problems that would be much less efficient using R or Python. In addition, there will be a variety of projects where we explore SQL using `sqlite3` and `MySQL/MariaDB`.
-
-We will start slowly, however, by remembering how to work with Jupyter Lab. In this project we will become re-familiarized with our development environment, review some, and prepare for the rest of the semester.
-
-**Context:** This is the first project of the semester! We will start with some review, and set the "scene" to learn about some powerful UNIX utilities, and SQL the rest of the semester.
-
-**Scope:** Jupyter Lab, R, Python, Anvil, markdown
-
-.Learning Objectives
-****
-- Read about and understand computational resources available to you.
-- Learn how to run R code in Jupyter Lab on Anvil.
-- Review R and Python.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data/flights/subset/1991.csv`
-
-== Setting Up to Work
-
-++++
-
-++++
-
-
-This year we will be using Jupyter Lab on the Anvil cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster.
-
-Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (including 2-factor authentication using Duo Mobile). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward.
-
-[TIP]
-====
-If you did not (yet) set up your 2-factor authentication credentials with Duo, you can set up the credentials here: https://the-examples-book.com/starter-guides/anvil/access-setup
-====
-
-Towards the middle of the top menu, click on the item labeled btn:[My Interactive Sessions]. (Depending on the size of your browser window, there might only be an icon; it is immediately to the right of the menu item for The Data Mine.) On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Bioinformatics, Interactive Apps, and The Data Mine. Under "The Data Mine" section, near the bottom of your screen, click on btn:[Jupyter Notebook]. (Make sure that you choose the Jupyter Notebook from "The Data Mine" section.)
-
-If everything was successful, you should see a screen similar to the following.
-
-image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"]
-
-Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 1 CPU core and 1918 MB of memory.
-
-[NOTE]
-====
-As you can see in the screenshot above, each core is associated with 1918 MB of memory. If you know how much memory your project will need, you can use this value to choose how many cores you want. In this and most of the other projects in this class, 1-2 cores is generally enough.
-====
-
-[NOTE]
-====
-Please use 4 cores for this project. This is _almost always_ excessive, but for this project in question 3 you will be reading in a rather large dataset that will very likely crash your kernel without at least 3-4 cores.
-====
-
-We use the Anvil cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine.
-
-After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on this button to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following.
-
-image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"]
-
-There are 2 primary options that you will need to know about.
-
-seminar::
-The `seminar` kernel runs Python code but also has the ability to run R code or SQL queries in the same environment.
-
-[TIP]
-====
-To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page].
-====
-
-seminar-r::
-The `seminar-r` kernel is intended for projects that **only** use R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell.
-
-For now, let's focus on the `seminar` kernel. Click on btn:[seminar], and a fresh notebook will be created for you.
-
-
-The first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`).
-
-Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`.
-
-There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain comments about your work).
-
-Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A.
-
-[TIP]
-====
-Make sure to read about and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-====
-
-== Questions
-
-=== Question 1 (1 pt)
-[upperalpha]
-.. How many cores and how much memory (in GB) does Anvil's sub-cluster A have? (0.5 pts)
-.. How many cores and how much memory (in GB) does your personal computer have? (0.5 pts)
-
-For this course, projects will be solved using the https://www.rcac.purdue.edu/compute/anvil[Anvil computing cluster].
-
-Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (often called RAM, or Random Access Memory). Use the information on the provided webpages to calculate how many cores and how much memory is available _in total_ for Anvil's "sub-cluster A".
-
-Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer of your own, work with a friend to see how many cores there are, and how much memory is available, on their computer.
-
-[TIP]
-====
-Information about the core and memory capacity of Anvil "sub-clusters" can be found https://www.rcac.purdue.edu/compute/anvil[here].
-
-Information about the core and memory capacity of your computer is typically found in the "About this PC" section of your computer's settings.
-====
-
-.Items to submit
-====
-- A sentence (in a markdown cell) explaining how many cores and how much memory is available to Anvil sub-cluster A.
-- A sentence (in a markdown cell) explaining how many cores and how much memory is available, in total, for your own computer.
-====
-
-=== Question 2 (1 pt)
-[upperalpha]
-.. Using Python, what is the name of the node on Anvil you are running on?
-.. Using Bash, what is the name of the node on Anvil you are running on?
-.. Using R, what is the name of the node on Anvil you are running on?
-
-Our next step will be to test out our connection to the Anvil Computing Cluster! Run the following code snippets in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on (in three different languages!). What is the name of the node on Anvil that you are running on?
-
-[source,python]
-----
-import socket
-print(socket.gethostname())
-----
-
-[source,r]
-----
-%%R
-
-system("hostname", intern=TRUE)
-----
-
-[source,bash]
-----
-%%bash
-
-hostname
-----
-
-[TIP]
-====
-To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu.
-====
-
-Check the results of each code snippet to ensure they all return the same hostname. Do they match? You may notice that `R` prints some extra "junk" output, while `bash` and `Python` do not. This is nothing to be concerned about as different languages can handle output differently, but it is good to take note of.
-
-.Items to submit
-====
-- Code used to solve this problem, along with the output of running that code.
-====
-
-=== Question 3 (1 pt)
-[upperalpha]
-.. Run each of the example code snippets below, and include them and their output in your submission to get credit for this question.
-
-++++
-
-++++
-
-
-[TIP]
-====
-Remember, in the upper right-hand corner of your notebook you will see the current kernel for the notebook, `seminar`. If you click on this name you will have the option to swap kernels out -- no need to do this now, but it is good to know!
-====
-
-In this course, we will be using Jupyter Lab with multiple different languages. Often, we will center a project around a specific language and choose the kernel for that langauge appropriately, but occasionally we may need to run a language in a kernel other than the one it is primarily built for. The solution to this is using line magic!
-
-Line magic tells our code interpreter that we are using a language other than the default for our kernel (i.e. The `seminar` kernel we are currently using is expecting Python code, but we can tell it to expect R code instead.)
-
-Line magic works by having the very first line in a code cell formatted like so:
-
-`%%language`
-
-Where `language` is the language we want to use. For example, if we wanted to run R code in our `seminar` kernel, we would use the following line magic:
-
-`%%R`
-
-Practice running the following examples, which include line magic where needed.
-
-python::
-[source,python]
-----
-import pandas as pd
-df = pd.read_csv('/anvil/projects/tdm/data/flights/subset/1991.csv')
-----
-
-[source,python]
-----
-df[df["Month"]==12].head() # get all flights in December
-----
-
-SQL::
-[source, ipython]
-----
-%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db
-----
-
-[source, sql]
-----
-%%sql
-
--- get all episodes called "Finale"
-SELECT *
-FROM episodes AS e
-INNER JOIN titles AS t
-ON t.title_id = e.episode_title_id
-WHERE t.primary_title = 'Finale'
-LIMIT 5;
-----
-
-bash::
-[source,bash]
-----
-%%bash
-
-names="John Doe;Bill Withers;Arthur Morgan;Mary Jane;Rick Ross;John Marston"
-echo $names | cut -d ';' -f 3
-echo $names | cut -d ';' -f 6
-----
-
-
-[NOTE]
-====
-In the above examples you will see lines such as `%%R` or `%%sql`. These are called "Line Magic". They allow you to run non-Python code in the `seminar` kernel. In order for line magic to work, it MUST be on the first line of the code cell it is being used in (before any comments or any code in that cell).
-
-In the future, you will likely stick to using the kernel that matches the project language, but we wanted you to have a demonstration about "line magic" in Project 1. Line magic is a handy trick to know!
-
-To learn more about how to run various types of code using the `seminar` kernel, see https://the-examples-book.com/projects/templates[our template page].
-====
-
-.Items to submit
-====
-- Code from the examples above, and the outputs produced by running that code.
-====
-
-=== Question 4 (1 pt)
-[upperalpha]
-.. How many code cells are there in the default template? (0.5 pts)
-.. How many markdown cells are there in the default template? (0.5 pts)
-
-As we mentioned in the `Setting Up` section of this project, there are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default?
-
-.Items to submit
-====
-- The number of cells of each type in the default template, in a markdown cell.
-====
-
-=== Question 5 (1 pt)
-[upperalpha]
-.. Create an unordered list of at least 3 of your favorite interests. Italicize at least one of these. (0.5 pts)
-.. Create an ordered list of at least 3 of your favorite interests. Embolden at least one of these, and make at least one other item formatted like `code`. (0.5 pts)
-
-Markdown is well worth learning about. You may already be familiar with it, but more practice never hurts, and there are plenty of niche tricks you may not know!
-
-[TIP]
-====
-For those new to Markdown, please review this https://www.markdownguide.org/cheat-sheet/[cheat sheet]!
-====
-
-Create a Markdown cell in your notebook. For this question, we would like you to create two lists as follows.
-
-Firstly, create an _unordered_ list of at least 3 of your favorite interests (some examples could include sports, animals, music, etc.). Within this list, _italicize_ at least one item.
-
-Secondly, create an _ordered_ list that orders the items in your previous list, from most favorite to least favorite. In this list, **embolden** at least one item, and make at least one other item formatted like `code`.
-
-[TIP]
-====
-Don't forget to "run" your markdown cells by clicking the small "Play" button in the notebook menu. Running a markdown cell will render the text in the cell with all of the formatting you specified. Your unordered lists will be bulleted and your ordered lists will be numbered.
-====
-
-.Items to submit
-====
-- Unordered list of 3+ items with at least one _italicized_ item.
-- Ordered list of 3+ items with at least one **emboldened** item and at least one `code` item.
-====
-
-=== Question 6 (1 pt)
-[upperalpha]
-.. Write your own LinkedIn "About" section using Markdown that includes a header, body text that you would be comfortable adding to your LinkedIn account, and at least one link using Markdown syntax.
-
-Browse https://www.linkedin.com and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown in a new Markdown cell, with the following features:
-
-- A header for this section (your choice of size) that says "About".
-- The body text of your personal "About" section that you would feel comfortable uploading to LinkedIn.
-- In the body text of your "About" section, _for the sake of learning markdown_, include at least 1 link using Markdown's link syntax.
-
-[TIP]
-====
-A Markdown header is a line of text at the top of a Markdown cell that begins with one or more `#`.
-====
-
-.Items to submit
-====
-- A markdown cell containing your LinkedIn "About" entry, as described above.
-====
-
-=== Question 7 (2 pts)
-[upperalpha]
-- Create a function in Python to print the median, mean, and standard deviation of the `DepDelay` column in our dataset, along with the shape of the `/anvil/projects/tdm/data/flights/subset/1991.csv` dataset overall. (1 pt)
-- Create an R function to print the median, mean, and standard deviation of the `DepDelay` column in our dataset, along with the shape of the `/anvil/projects/tdm/data/flights/subset/1991.csv` dataset overall. (1 pt)
-
-This question may seem a bit difficult at first, but these are all concepts we covered in the 100 level of the class! Remember, your previous projects are still on Anvil (assuming you haven't deleted/overwritten them) and can be a great resource to look back on. You may also look back at the previous 100 level project instructions on The Examples Book.
-
-Using `pandas` in Python, create a function that takes a dataframe as input and prints the shape of the dataframe along with the mean, median, and standard deviation of the `DepDelay` column of that dataframe. Print your results formatted as follows:
-
-```
-MyDF Summary Statistics ---
-Shape: (rows, columns)
-Mean: 123.456
-Median: 123.456
-Standard Deviation: 123.456
----------------------------
-```
-
-Then, recreate your function but this time using R. Remember that you will need to use the `%%R` line magic at the top of your cell to tell the kernel that you are using R code. You should not need to import any libraries in order to do this.
-
-[TIP]
-====
-The `R` equivalent of `print()` is `cat()`.
-====
-
-[NOTE]
-====
-It is not important that your function output is formatted the exact same as ours. What is important, however, is that any printing that occurs in your code is neat and well formatted. If it is hard for the graders to read, you may lose points. Do your best and we will always work together to improve things.
-====
-
-Make sure your code is complete, and well-commented. Double check that both functions return the same values as a built-in sanity check for your code.
-
-.Items to submit
-====
-- Python Function to print median, mean, and standard deviation of the `DepDelay` column of our dataset, along with the shape of the dataset.
-- R Function to print median, mean, and standard deviation of the `DepDelay` column of our dataset, along with the shape of the dataset.
-====
-
-=== Submitting your Work
-
-++++
-
-++++
-
-
-Congratulations, you just finished your first assignment for this class! Now that we've written some code and added some markdown cells to explain what we did, we are ready to submit our assignment. For this course, we will turn in a variety of files, depending on the project.
-
-We will always require a Jupyter Notebook file. Jupyter Notebook files end in `.ipynb`. This is our "source of truth" and what the graders will turn to first when grading.
-
-[WARNING]
-====
-You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this.
-
-You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this.
-====
-
-A `.ipynb` file is generated by first running every cell in the notebook (which can be done quickly by pressing the "double play" button along the top of the page), and then clicking the "Download" button from menu:File[Download].
-
-In addition to the `.ipynb` file, an additional file should be included for each programming language in the project containing all of the code from that langauge that is in the project. A full list of files required for the submission will be listed at the bottom of the project page.
-
-Let's practice. Take the R code from this project and copy and paste it into a text file with the `.R` extension. Call it `firstname-lastname-project01.R`. Do the same for each programming language, and ensure that all files in the submission requirements below are included. Once complete, submit all files as named and listed below to Gradescope.
-
-.Items to submit
-====
-- `firstname-lastname-project01.ipynb`.
-- `firstname-lastname-project01.R`.
-- `firstname-lastname-project01.py`.
-- `firstname-lastname-project01.sql`.
-- `firstname-lastname-project01.sh`.
-====
-
-[WARNING]
-====
-_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted.
-
-In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project.
-====
-
-Here is the Zoom recording of the 4:30 PM discussion with students from 21 August 2023:
-
-++++
-
-++++
diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project02.adoc
deleted file mode 100644
index 5c819f74c..000000000
--- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project02.adoc
+++ /dev/null
@@ -1,395 +0,0 @@
-= TDM 20100: Project 2 -- 2023
-
-**Motivation:** The ability to navigate a shell, like `bash`, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful UNIX tools, help you navigate a filesystem, and even run UNIX tools from within your Jupyter Lab notebook.
-
-**Context:** At this point in time, our Jupyter Lab system, using https://ondemand.anvil.rcac.purdue.edu, is new to some of you, and maybe familiar to others. The comfort with which you each navigate this UNIX-like operating system will vary. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within Jupyter Lab.
-
-**Scope:** bash, Jupyter Lab
-
-.Learning Objectives
-****
-- Distinguish differences in `/home`, `/anvil/scratch`, and `/anvil/projects/tdm`.
-- Navigating UNIX via a terminal: `ls`, `pwd`, `cd`, `.`, `..`, `~`, etc.
-- Analyzing file in a UNIX filesystem: `wc`, `du`, `cat`, `head`, `tail`, etc.
-- Creating and destroying files and folder in UNIX: `scp`, `rm`, `touch`, `cp`, `mv`, `mkdir`, `rmdir`, etc.
-- Use `man` to read and learn about UNIX utilities.
-- Run `bash` commands from within Jupyter Lab.
-****
-
-Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here].
-
-== Dataset(s)
-
-The following questions will use the following dataset(s):
-
-- `/anvil/projects/tdm/data`
-
-== Questions
-
-[IMPORTANT]
-====
-If you are not a `bash` user and you use an alternative shell like `zsh` or `tcsh`, you will want to switch to `bash` for the remainder of the semester, for consistency. Of course, if you plan on just using Jupyter Lab cells, the `%%bash` magic will use `/bin/bash` rather than your default shell, so you will not need to do anything.
-====
-
-[NOTE]
-====
-While it is not _super_ common for us to push a lot of external reading at you (other than the occasional blog post or article), https://learning.oreilly.com/library/view/learning-the-unix/0596002610[this] is an excellent, and _very_ short resource to get you started using a UNIX-like system. We strongly recommend readings chapters: 1, 3, 4, 5, & 7. It is safe to skip chapters 2, 6, and 8.
-====
-
-=== Question 1 (1 pt)
-[upperalpha]
-.. A list of length >=2 of modifications you made to your environment, in a markdown cell.
-
-Let's ease into this project by taking some time to adjust the environment you will be using the entire semester, to your liking. Begin by launching your Jupyter Lab session from https://ondemand.anvil.rcac.purdue.edu.
-
-Open your settings by navigating to menu:Settings[Advanced Settings Editor].
-
-Explore the settings, and make at least 2 modifications to your environment, and list what you've changed.
-
-Here are some settings Kevin likes:
-
-- menu:Theme[Selected Theme > JupyterLab Dark]
-- menu:Document Manager[Autosave Interval > 30]
-- menu:File Browser[Show hidden files > true]
-- menu:Notebook[Line Wrap > on]
-- menu:Notebook[Show Line Numbers > true]
-- menu:Notebook[Shut down kernel > true]
-
-Dr. Ward does not like to customize his own environment, but he _does_ use the Emacs key bindings. Jackson _loves_ to customize his own environment, but he _despises_ Emacs bindings. Feel free to choose whatever is most comfortable to you.
-
-- menu:Settings[Text Editor Key Map > emacs]
-
-[IMPORTANT]
-====
-Only modify your keybindings if you know what you are doing, and like to use Emacs/Vi/etc.
-====
-
-.Items to submit
-====
-- List (using a markdown cell) of the modifications you made to your environment.
-====
-
-=== Question 2 (1 pt)
-[upperalpha]
-.. In a markdown cell, what is the absolute path of your home directory in Jupyter Labs?
-
-In the previous project's question 3, we used a tool called `awk` to parse through a dataset. This was an example of running bash code using the `seminar` kernel. Aside from use the `%%bash` magic from the previous project, there are 2 other straightforward ways to run bash code from within Jupyter Lab.
-
-The first method allows you to run a bash command from within the same cell as a cell containing Python code. For example, using `ls` can be done like so:
-
-[source,ipython]
-----
-!ls
-
-import pandas as pd
-myDF = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
-myDF.head()
-----
-
-[NOTE]
-====
-This does _not_ require you to have other, Python code in the cell. The following is perfectly valid.
-
-[source,ipython]
-----
-!ls
-!ls -la /anvil/projects/tdm/
-----
-
-With that being said, using this method, each line _must_ start with an exclamation point.
-====
-
-The second method is to open up a new terminal session. To do this, go to menu:File[New > Terminal]. This should open a new tab and a shell for you to use. You can make sure the shell is working by typing your first command, `man`.
-
-[source,bash]
-----
-# man is short for manual, to quit, press "q"
-# use "k" or the up arrow to scroll up, or "j" or the down arrow to scroll down.
-man man
-----
-
-Great! Now that you've learned 2 new ways to run `bash` code from within Jupyter Lab, please answer the following question:
-
-What is the _absolute path_ of the default directory of your `bash` shell? When we say "default directory" we mean the folder that you are "in" when you first run `bash` code in a Jupyter cell or when you first open a Terminal. This is also referred to as the home directory.
-
-**Relevant topics:** https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/pwd[pwd]
-
-.Items to submit
-====
-- `bash` code to print the full filepath of the default directory (home directory), and the output of running that code. Ex: Kevin's is: `/home/x-kamstut` and Dr Ward's is: `/home/x-mdw`.
-====
-
-=== Question 3 (1 pt)
-[upperalpha]
-.. `bash` to navigate to `/anvil/projects/tdm/data`
-.. `bash` to print the current working directory
-.. `bash` to list the files in the current working directory
-.. `bash` to list _all_ of the files in `/anvil/projects/tdm/data/movies_and_tv`, _including_ hidden files
-.. `bash` to return to your home directory
-.. `bash` to confirm that you are back in your home directory (print your current working directory)
-
-It is a critical skill to be able to navigate a UNIX-like operating system, and you will very likely need to use UNIX or Linux (or something similar) at some point in your career. For this question, write `bash` code to perform the following tasks in order. In your final submission, please ensure that all of your steps and their outputs are included.
-
-[WARNING]
-====
-For the sake of consistency, please run your `bash` code using the `%%bash` magic. This ensures that we are all using the correct shell (there are many shells), and that your work is displayed properly for your grader.
-====
-
-. Navigate to the directory containing the datasets used in this course: `/anvil/projects/tdm/data`.
-. Print the current working directory. Is the result what you expected?
-. Output the `$PWD` variable, using the `echo` command.
-. List the files within the current working directory (excluding subfiles).
-. Without navigating out of `/anvil/projects/tdm/data`, list _all_ of the files within the the `movies_and_tv` directory, _including_ hidden files.
-. Return to your home directory.
-. Write a command to confirm that you are back in your home directory.
-
-[NOTE]
-====
-`/` is commonly referred to as the root directory in a UNIX-like system. Think of it as a folder that contains _every_ other folder in the computer. `/home` is a folder within the root directory. `/home/x-kamstut` is the _absolute path_ of Kevin's home directory.
-====
-
-**Relevant topics:**
-
-https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/pwd[pwd], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/cd[cd], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/ls[ls]
-
-.Items to submit
-====
-- `bash` to navigate to `/anvil/projects/tdm/data`, and print the current working directory
-- `bash` to list the primary files in the current working directory
-- `bash` to list _all_ of the files in `/anvil/projects/tdm/data/movies_and_tv`, _including_ hidden files
-- `bash` to return to your home directory and confirm you are there.
-====
-
-=== Question 4 (1 pt)
-[upperalpha]
-.. Write a single command to navigate to the modulefiles directory: `/anvil/projects/tdm/opt/lmod`, then confirm that you are in the correct directory using the `echo` command. (0.5 pts)
-.. Write a single command to navigate back to your home directory, using _relative_ paths, then confirm that you are in the correct directory using the 'echo' command. (0.5 pts)
-
-When running the `ls` command (specifically the `ls` command that showed hidden files and folders), you may have noticed two oddities that appeared in the output: `.` and `..`. `.` represents the directory you are currently in, or, if it is a part of a path, it means "this directory". For example, if you are in the `/anvil/projects/tdm/data` directory, the `.` refers to the `/anvil/projects/tdm/data` directory. If you are running the following bash command, the `.` is redundant and refers to the `/anvil/projects/tdm/data/yelp` directory.
-
-[source,bash]
-----
-ls -la /anvil/projects/tdm/data/yelp/.
-----
-
-`..` represents the parent directory, relative to the rest of the path. For example, if you are in the `/anvil/projects/tdm/data` directory, the `..` refers to the parent directory, `/anvil/projects/tdm`.
-
-Any path that contains either `.` or `..` is called a _relative path_ (because it is _relative_ to the directory you are currently in). Any path that contains the entire path, starting from the root directory, `/`, is called an _absolute path_.
-
-For this question, perform the following operations in order. Each operation should be a single command. In your final submission, please ensure that all of your steps and their outputs are included.
-
-. Write a single command to navigate to our modulefiles directory: `/anvil/projects/tdm/opt/lmod`.
-. Confirm that you are in the correct directory using the `echo` command.
-. Write a single command to navigate back to your home directory, however, rather than using `cd`, `cd ~`, or `cd $HOME` without the path argument, use `cd` and a _relative_ path.
-. Confirm that you are in the corrrect directory using the `echo` command.
-
-[NOTE]
-====
-If you don't fully understand the text above, _please_ take the time to understand it. It will be incredibly helpful to you, not only in this class, but in your career. You can also come to seminar or visit TA office hours to get assistance. We love to talk to students, and everyone benefits when we all collaborate.
-====
-
-**Relevant topics:** https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/pwd[pwd], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/cd[cd], https://the-examples-book.com/starter-guides/tools-and-standards/unix/other-topics/special-symbols[special-symbols]
-
-.Items to submit
-====
-- Single command to navigate to the modulefiles directory.
-- Single command to navigate back to your home directory using _relative_ paths.
-- Commands confirming your navigation steps were successful.
-====
-
-
-=== Question 5 (1 pt)
-[upperalpha]
-.. Navigate to your scratch directory using environment variables.
-.. Run `tokei` on your home directory (use an environment variable).
-.. Output the first 5 lines and last 5 lines of `/anvil/datasets/training/anvil-101/batch-test/batch-test-README`. Make sure it is clear which lines are the first 5 and which are the last 5.
-.. Output the number of lines in `/anvil/datasets/training/anvil-101/batch-test/batch-test-README`
-.. Output the size, in bytes, of `/anvil/datasets/training/anvil-101/batch-test/batch-test-README`
-.. Output the location of the `tokei` program we used earlier.
-
-[NOTE]
-====
-`$SCRATCH` and `$USER` are referred to as _environment variables_. You can see what they are by typing `echo $SCRATCH` and `echo $USER`. `$SCRATCH` contains the absolute path to your scratch directory, and `$USER` contains the username of the current user. We will learn more about these in the rest of this question.
-====
-
-Your `$HOME` directory is your default directory. You can navigate to your `$HOME` directory using any of the following commands.
-
-[source,bash]
-----
-cd
-cd ~
-cd $HOME
-cd /home/$USER
-----
-
-This is typically where you will work, and where you will store your work (for instance, your completed projects).
-
-The `/anvil/projects/tdm` space is a directory created for The Data Mine. It holds our datasets (in the `data` directory), as well as data for many of our corporate partners projects.
-
-There exists 1 more important location on each cluster, `scratch`. Your `scratch` directory is located at `/anvil/scratch/$USER`, or, even shorter, `$SCRATCH`. `scratch` is meant for use with _really_ large chunks of data. The quota on Anvil is currently 100TB and 1 million files. You can see your quota and usage on Anvil by running the following command.
-
-[source,bash]
-----
-myquota
-----
-
-[NOTE]
-====
-Doug Crabill is the one of the Data Mine's extraordinarily wise computer wizards, and he has kindly collated a variety of useful scripts to be publicly available to students. These can be found in `/anvil/projects/tdm/bin`. Feel free to explore this directory and learn about these scripts in your free time.
-====
-
-One of the helpful scripts we have at our disposal is `tokei`, a code analysis tool. We can use this tool to quickly determine the language makeup of a project. An in-depth explanation of tokei can be found https://github.com/XAMPPRocky/tokei[here], but for now, you can use it like so:
-
-[source,bash]
-----
-tokei /path/to/project
-----
-
-Sometimes, you may want to know what the first or last few lines of your file look like. https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/head[head] and https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/tail[tail] can help us do that. Take a look at their documentation to learn more.
-
-One goal of our programs is often to be size-efficient. If we have a very simple program, but it is enormous, it may not be worth our time to download and use. The https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/wc[wc] tool can help us determine the size of our file. Take a look at its documentation for more information.
-
-[CAUTION]
-====
-Be careful. We want the size of the script, not the disk usage.
-====
-
-Finally, we often may know that a program exists, but we don't know where it is. https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/which[which] can help us find the location of a program. Take a look at its documentation for more information, and use it to solve the last part of this question.
-
-[TIP]
-====
-Commands often have _options_. _Options_ are features of the program that you can trigger specifically. You can see the options of a command in the DESCRIPTION section of the man pages.
-
-[source,bash]
-----
-man wc
-----
-
-You can see -m, -l, and -w are all options for `wc`. Then, to test the options out, you can try the following examples.
-
-[source,bash]
-----
-# using the default wc command. "/anvil/projects/tdm/data/flights/1987.csv" is the first "argument" given to the command.
-wc /anvil/projects/tdm/data/flights/1987.csv
-
-# to count the lines, use the -l option
-wc -l /anvil/projects/tdm/data/flights/1987.csv
-
-# to count the words, use the -w option
-wc -w /anvil/projects/tdm/data/flights/1987.csv
-
-# you can combine options as well
-wc -w -l /anvil/projects/tdm/data/flights/1987.csv
-
-# some people like to use a single "tack" `-`
-wc -wl /anvil/projects/tdm/data/flights/1987.csv
-
-# order doesn't matter
-wc -lw /anvil/projects/tdm/data/flights/1987.csv
-----
-====
-
-**Relevant topics:** https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/pwd[pwd], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/cd[cd], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/head[head], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/tail[tail], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/wc[wc], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/which[which], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/type[type]
-
-.Items to submit
-====
-- Navigate to your scratch directory, and run tokei on your home directory, using only environment variables.
-- Print out the first 5 lines and last 5 lines of the `/anvil/datasets/training/anvil-101/batch-test/batch-test-README` file.
-- Print out the number of lines in the `/anvil/datasets/training/anvil-101/batch-test/batch-test-README` file.
-- Print out the size in bytes of the `/anvil/datasets/training/anvil-101/batch-test/batch-test-README` file.
-- Print out the location of the `tokei` program we used earlier in this question.
-====
-
-=== Question 6 (2 pts)
-[upperalpha]
-.. Navigate to your scratch directory.
-.. Copy the file `/anvil/projects/tdm/data/movies_and_tv/imdb.db` to your current working directory.
-.. Create a new directory called `movies_and_tv` in your current working directory.
-.. Move the file, `imdb.db`, from your scratch directory to the newly created `movies_and_tv` directory (inside of scratch).
-.. Use `touch` to create a new, empty file called `im_empty.txt` in your scratch directory.
-.. Remove the directory, `movies_and_tv`, from your scratch directory, including _all_ of the contents.
-.. Remove the file, `im_empty.txt`, from your scratch directory.
-
-Now that we know how to navigate a UNIX-like system, let's learn how to create, move, and delete files and folders. For this question, perform the following operations in order. Each operation should be a single command. In your final submission, please ensure that all of your steps and their outputs are included.
-
-First, let's review the `cp` command. `cp` is short for copy, and it is used to copy files and folders. The syntax is as follows:
-
-[source,bash]
-----
-cp