diff --git a/tools-appendix/modules/python/images/matplot-histogram-aa.png b/tools-appendix/modules/python/images/matplot-histogram-aa.png new file mode 100644 index 000000000..46962f7ce Binary files /dev/null and b/tools-appendix/modules/python/images/matplot-histogram-aa.png differ diff --git a/tools-appendix/modules/python/images/matplot-scatterplot-aa.png b/tools-appendix/modules/python/images/matplot-scatterplot-aa.png new file mode 100644 index 000000000..78598b0fc Binary files /dev/null and b/tools-appendix/modules/python/images/matplot-scatterplot-aa.png differ diff --git a/tools-appendix/modules/python/pages/filtering-and-selecting.adoc b/tools-appendix/modules/python/pages/filtering-and-selecting.adoc index 39c81252e..1c8aecbb1 100644 --- a/tools-appendix/modules/python/pages/filtering-and-selecting.adoc +++ b/tools-appendix/modules/python/pages/filtering-and-selecting.adoc @@ -246,20 +246,21 @@ myDF[['ResidentStatus', 'Age']] The output of selecting multiple columns using the double brackets is a pandas `DataFrame`: ---- -ResidentStatus Age -0 1 87 -1 1 58 -2 1 75 -3 1 74 -4 1 64 -... ... ... -2631166 3 84 -2631167 3 74 -2631168 3 7 -2631169 4 49 -2631170 3 39 - -2631171 rows × 2 columns + ResidentStatus Age +0 1 87 +1 1 58 +2 1 75 +3 1 74 +4 1 64 +... ... ... +2631166 3 84 +2631167 3 74 +2631168 3 7 +2631169 4 49 +2631170 3 39 + +[2631171 rows x 2 columns] + ---- == The iloc function @@ -295,9 +296,10 @@ AgeType 1 Age 87 AgeSubstitutionFlag 0 Name: 0, dtype: object + ---- -We can also use `iloc[]` to select the first row (index 0) and all columns using (:): +We can also use `iloc[]` to select the first row (index 0) and all columns using `(:)` : [source,python] ---- myDF.iloc[0, :] @@ -379,16 +381,17 @@ filtered_myDF ---- - Id ResidentStatus Sex Age Race MaritalStatus -0 1 1 M 87 1 M -1 2 1 M 58 1 D -2 3 1 F 75 1 W -3 4 1 M 74 1 D -4 5 1 M 64 1 D -... ... ... ... ... ... ... + Id ResidentStatus Sex Age Race MaritalStatus +0 1 1 M 87 1 M +1 2 1 M 58 1 D +2 3 1 F 75 1 W +3 4 1 M 74 1 D +4 5 1 M 64 1 D +... ... ... ... ... ... ... + ---- -Finally, let's try selecting multiple rows and multiple columns at the same time. When selecting multiple rows and multiple columns using iloc, the output is a subset of the DataFrame that contains the specified rows and all the columns. In this example, myDF.iloc[[0, 7, 9, 10], :] specifies the selection of rows 0, 7, 9, and 10 and all columns: +Finally, let's try selecting multiple rows and multiple columns at the same time. When selecting multiple rows and multiple columns using iloc, the output is a subset of the DataFrame that contains the specified rows and all the columns. In this example, `myDF.iloc[[0, 7, 9, 10], :]` specifies the selection of rows 0, 7, 9, and 10 and all columns: [source,python] ---- @@ -397,11 +400,12 @@ filtered_myDF.iloc[[0, 7, 9, 10], :] ---- - Id ResidentStatus Sex Age Race MaritalStatus -0 1 1 M 87 1 M -7 8 1 M 55 2 S -9 10 1 M 23 1 S -10 11 1 F 79 1 W + Id ResidentStatus Sex Age Race MaritalStatus +0 1 1 M 87 1 M +7 8 1 M 55 2 S +9 10 1 M 23 1 S +10 11 1 F 79 1 W + ---- == The loc function @@ -426,13 +430,14 @@ filtered_myDF.loc[:, filtered_myDF.columns != 'Race'] ---- ---- - Id ResidentStatus Sex Age MaritalStatus -0 1 1 M 87 M -1 2 1 M 58 D -2 3 1 F 75 W -3 4 1 M 74 D -4 5 1 M 64 D -... ... ... ... ... ... + Id ResidentStatus Sex Age MaritalStatus +0 1 1 M 87 M +1 2 1 M 58 D +2 3 1 F 75 W +3 4 1 M 74 D +4 5 1 M 64 D +... ... ... ... ... ... + ---- @@ -493,15 +498,15 @@ filtered_myDF[filtered_myDF['Sex'] == "F"] ---- ---- - Id ResidentStatus Sex Age Race MaritalStatus -2 3 1 F 75 1 W -5 6 1 F 93 1 W -8 9 1 F 86 1 W -10 11 1 F 79 1 W -12 13 1 F 85 1 W - ... ... ... ... ... ... + Id ResidentStatus Sex Age Race MaritalStatus +2 3 1 F 75 1 W +5 6 1 F 93 1 W +8 9 1 F 86 1 W +10 11 1 F 79 1 W +12 13 1 F 85 1 W +... ... ... ... ... ... ... +[1299710 rows × 6 columns] -1299710 rows × 6 columns ---- We can also use `.loc` for filtering for females. @@ -512,13 +517,15 @@ filtered_myDF.loc[filtered_myDF['Sex'] == "F"] ---- ---- - Id ResidentStatus Sex Age Race MaritalStatus -2 3 1 F 75 1 W -5 6 1 F 93 1 W -8 9 1 F 86 1 W -10 11 1 F 79 1 W -12 13 1 F 85 1 W - ... ... ... ... ... ... + Id ResidentStatus Sex Age Race MaritalStatus +2 3 1 F 75 1 W +5 6 1 F 93 1 W +8 9 1 F 86 1 W +10 11 1 F 79 1 W +12 13 1 F 85 1 W +... ... ... ... ... ... ... +[1299710 rows × 6 columns] + ---- Now let's filter for two things. Let's filter for Females who are 114 years old. Suprisingly, some people do live that long based on our dataset! @@ -529,12 +536,13 @@ filtered_myDF[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)] ---- ---- - Id ResidentStatus Sex Age Race MaritalStatus -265482 265483 1 F 114 1 W -1304830 1304831 1 F 114 1 W -1372655 1372656 1 F 114 2 W -1981235 1981236 1 F 114 2 W -2407245 2407246 1 F 114 4 M + Id ResidentStatus Sex Age Race MaritalStatus +265482 265483 1 F 114 1 W +1304830 1304831 1 F 114 1 W +1372655 1372656 1 F 114 2 W +1981235 1981236 1 F 114 2 W +2407245 2407246 1 F 114 4 M + ---- Another method that would get us the same results: @@ -546,12 +554,13 @@ filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)] ---- ---- - Id ResidentStatus Sex Age Race MaritalStatus -265482 265483 1 F 114 1 W -1304830 1304831 1 F 114 1 W -1372655 1372656 1 F 114 2 W -1981235 1981236 1 F 114 2 W -2407245 2407246 1 F 114 4 M + Id ResidentStatus Sex Age Race MaritalStatus +265482 265483 1 F 114 1 W +1304830 1304831 1 F 114 1 W +1372655 1372656 1 F 114 2 W +1981235 1981236 1 F 114 2 W +2407245 2407246 1 F 114 4 M + ---- === Filtering and Modifying the Dataset diff --git a/tools-appendix/modules/python/pages/index.adoc b/tools-appendix/modules/python/pages/index.adoc index 46d2e7791..ffc62716a 100644 --- a/tools-appendix/modules/python/pages/index.adoc +++ b/tools-appendix/modules/python/pages/index.adoc @@ -18,6 +18,7 @@ Python is largely known for its readability and versatility. Its design philosop * xref:plotly-examples.adoc[Data Visualization with plotly] * xref:writing-functions.adoc[Writing Functions in Python] * xref:writing-scripts.adoc[Writing Scripts in Python] +* xref:pandas-series.adoc[Pandas Series] * xref:pandas-dates-and-times.adoc[Handling Dates and Times in pandas] * xref:pandas-aggregate-functions.adoc[Applying Aggregate Functions in pandas] * xref:pandas-reshaping.adoc[Reshaping Data in pandas] diff --git a/tools-appendix/modules/python/pages/matplotlib.adoc b/tools-appendix/modules/python/pages/matplotlib.adoc index 6cce8210c..41c12dcc7 100644 --- a/tools-appendix/modules/python/pages/matplotlib.adoc +++ b/tools-appendix/modules/python/pages/matplotlib.adoc @@ -1,11 +1,13 @@ -= matplotlib += Matplotlib When starting with Python, the most common plotting package is often `matplotlib`. It is an easy and straightforward plotting tool, with a surprising amount of depth. Like any package, it also has pluses and minuses. Importing `matplotlib` for use in a project is pretty straightforward: -* <> -* <> +* <> +* <> +* <> +* <> [source,python] ---- @@ -26,7 +28,7 @@ For those of us who aren't familiar with MATLAB the `pyplot` functionality creat {sp}+ -== barplot +== Barplots Using Matplotlib Barplots can take many forms. They are most often utilized when comparing change over time or comparisons between categories for a data set. As with many of the plotting types `matplotlib` has the built-in `barplot` function to create the visualizations. @@ -289,7 +291,7 @@ plt.close() This just starts to scratch the surface of what is possible with `matplotlib` but it does show the deep customization that is possible via the package. -== boxplot +== Boxplots Using Matplotlib `boxplot` is a function that creates a https://en.wikipedia.org/wiki/Box_plot[boxplot]. While that may not be very surprising, it is surprising how helpful boxplots can be in summarizing your data. Boxplots show a number of different measures related to the data such as quartiles, upper and lower bounds, and potential outliers. They can also he helpful to identify general trends between groups or over time. However, it should be noted there may be better plots for specific use cases. @@ -466,3 +468,53 @@ plt.close() image::box_6.png[Boxplot with better color, width=792, height=500, loading=lazy, title="Boxplot with better color"] Now we have a good looking boxplot! Hopefully this demonstration showed how helpful boxplots can be when interpreting data. It also shows how `matplotlib` plots can be further customized, to fit the needs of the visualization! + +== Histograms Using Matplotlib + +A histogram is a way to visualize the distribution of numerical data. In Python, it groups data points into intervals (called bins) and uses bars to represent the frequency of data falling within each interval. The height of each bar shows how many data points are in that range. + +Let's visualize the precipitation data in our dataset by plotting a histogram with Matplotlib. + + +[source,python] +---- +myDF = pd.read_csv("/anvil/projects/tdm/data/precip/precip.csv") +plt.hist(myDF['precip'], bins=10, edgecolor='black') +plt.title('Histogram of Precipitation') +plt.xlabel('Precipitation (inches)') +plt.ylabel('Frequency') +plt.show() +---- + + +image::matplot-histogram-aa.png[Plotting a histogram, width=792, height=500, loading=lazy, title="Histogram in Matplotlib"] + + + +== Scatterplots Using Matplotlib + +A scatter plot is a way to visualize the relationship between two variables. In Python, it uses individual points plotted on a Cartesian plane, where the position of each point is determined by its values for the two variables. Scatter plots are useful for identifying patterns, trends, or correlations in the data. + +Let's visualize the precipitation data in our dataset by plotting a scatter plot with Matplotlib. + +[source,python] +---- +import pandas as pd +import matplotlib.pyplot as plt + +myDF = pd.read_csv("/anvil/projects/tdm/data/precip/precip.csv") +plt.scatter(myDF['place'].iloc[:10], myDF['precip'].iloc[:10], color='blue') + +plt.title("Scatter Plot of Precipitation (Top 10 Places)") +plt.xlabel("Place") +plt.ylabel("Precipitation (inches)") + +plt.xticks(rotation=45) +plt.tight_layout() +plt.show() +---- + +When creating plots, it's improtant to try to understand the overall trends they reveal. From the plot, we observe that among the first 10 places, Mobile, Phoenix, and Little Rock have the highest precipitation levels. + +image::matplot-scatterplot-aa.png[Plotting a scatterplot, width=792, height=500, loading=lazy, title="Scatterplot in Matplotlib"] +