diff --git a/.nojekyll b/.nojekyll
index 924a3d8..d1830b7 100644
--- a/.nojekyll
+++ b/.nojekyll
@@ -1 +1 @@
-29800017
\ No newline at end of file
+2064708a
\ No newline at end of file
diff --git a/index.html b/index.html
index a227130..17c27f7 100644
--- a/index.html
+++ b/index.html
@@ -249,7 +249,7 @@
Running the code
You can install the exact packages that the book uses with the env.yml file:
mamba env create -f env.yml
If you’re not using mamba/conda you can install the following package versions and it should work:
/tmp/ipykernel_28258/1317853993.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
diff --git a/method_chaining.html b/method_chaining.html
index d90d636..e713236 100644
--- a/method_chaining.html
+++ b/method_chaining.html
@@ -690,7 +690,7 @@
/tmp/ipykernel_28342/4223446110.py:17: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
/tmp/ipykernel_28342/2848021590.py:12: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
CPU times: user 182 ms, sys: 26.7 ms, total: 209 ms
-Wall time: 52 ms
+
CPU times: user 149 ms, sys: 20.8 ms, total: 170 ms
+Wall time: 42.8 ms
@@ -762,8 +762,8 @@
.rename(columns={"↓OVA": "OVA"})
)
-
CPU times: user 7.14 s, sys: 401 ms, total: 7.54 s
-Wall time: 7.57 s
+
CPU times: user 6.53 s, sys: 336 ms, total: 6.87 s
+Wall time: 6.86 s
@@ -1027,7 +1027,7 @@
)
).collect()
-
4 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
3.91 s ± 53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
On my machine the NumPy version tends to be 5-20% faster than the pure Polars version:
@@ -1042,7 +1042,7 @@
)
).collect()
-
5.19 s ± 296 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
4.64 s ± 51.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This may not be a huge performance difference, but it at least means you don’t sacrifice speed when relying on NumPy. There are some gotchas though so watch out for those.
@@ -1057,7 +1057,7 @@
collected["LONGITUDE_right"].to_numpy()
)
-
6.23 s ± 79.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
5.22 s ± 187 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@@ -1092,7 +1092,7 @@
%timeit rand_df_pl.select(polars_transform())
-
3.32 s ± 91.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
3.08 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@@ -1100,7 +1100,7 @@
%timeit pandas_transform(rand_df_pd)
-
2.18 s ± 34.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
2.18 s ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CPU times: user 25.8 s, sys: 3.92 s, total: 29.8 s
-Wall time: 17.2 s
+
CPU times: user 25.5 s, sys: 3.99 s, total: 29.5 s
+Wall time: 17 s
@@ -620,7 +620,7 @@
-shape: (10, 2)
OCCUPATION
TRANSACTION_AMT
cat
f64
"CHAIRMAN CEO &…
1.0233e6
"PAULSON AND CO…
1e6
"CO-FOUNDING DI…
875000.0
…
…
"CHIEF EXECUTIV…
500000.0
"MOORE CAPITAL …
500000.0
+shape: (10, 2)
OCCUPATION
TRANSACTION_AMT
cat
f64
"CHAIRMAN CEO &…
1.0233e6
"PAULSON AND CO…
1e6
"CO-FOUNDING DI…
875000.0
…
…
"MOORE CAPITAL …
500000.0
"PERRY HOMES"
500000.0
diff --git a/search.json b/search.json
index 4155a0c..9e2d452 100644
--- a/search.json
+++ b/search.json
@@ -32,7 +32,7 @@
"href": "index.html#running-the-code-yourself",
"title": "Modern Polars",
"section": "Running the code yourself",
- "text": "Running the code yourself\nYou can install the exact packages that the book uses with the env.yml file:\nmamba env create -f env.yml\nIf you’re not using mamba/conda you can install the following package versions and it should work:\npolars: 0.20.2\npyarrow: 10.0.1\npandas: 2.1.1\nnumpy: 1.23.5\nfsspec: 2022.11.0\nmatplotlib: 3.8.0\nseaborn: 0.13.0\nstatsmodels: 0.14.0\nfilprofiler: 2022.11.0\n\nData\nAll the data fetching code is included, but will eventually break as websites change or shut down. The smaller datasets have been checked in here for posterity."
+ "text": "Running the code yourself\nYou can install the exact packages that the book uses with the env.yml file:\nmamba env create -f env.yml\nIf you’re not using mamba/conda you can install the following package versions and it should work:\npolars: 0.20.16\npyarrow: 10.0.1\npandas: 2.1.1\nnumpy: 1.23.5\nfsspec: 2022.11.0\nmatplotlib: 3.8.0\nseaborn: 0.13.0\nstatsmodels: 0.14.0\nfilprofiler: 2022.11.0\n\nData\nAll the data fetching code is included, but will eventually break as websites change or shut down. The smaller datasets have been checked in here for posterity."
},
{
"objectID": "index.html#contributing",
@@ -53,7 +53,7 @@
"href": "indexing.html#read-the-data",
"title": "1 Indexing (Or Lack Thereof)",
"section": "1.2 Read the data",
- "text": "1.2 Read the data\n\n\n\n\n\n\nTip\n\n\n\nThe examples in this book use the lazy evaluation feature of Polars less than you should. It’s just inconvenient to use the lazy API when displaying dozens of intermediate results for educational purposes.\n\n\n\nPolarsPandas\n\n\n\nimport polars as pl\npl.Config.set_tbl_rows(5) # don't print too many rows in the book\ndf_pl = pl.read_csv(extracted, truncate_ragged_lines=True)\ndf_pl\n\n\n\nshape: (537_902, 110)YearQuarterMonthDayofMonthDayOfWeekFlightDateReporting_AirlineDOT_ID_Reporting_AirlineIATA_CODE_Reporting_AirlineTail_NumberFlight_Number_Reporting_AirlineOriginAirportIDOriginAirportSeqIDOriginCityMarketIDOriginOriginCityNameOriginStateOriginStateFipsOriginStateNameOriginWacDestAirportIDDestAirportSeqIDDestCityMarketIDDestDestCityNameDestStateDestStateFipsDestStateNameDestWacCRSDepTimeDepTimeDepDelayDepDelayMinutesDepDel15DepartureDelayGroupsDepTimeBlkTaxiOut…Div1TotalGTimeDiv1LongestGTimeDiv1WheelsOffDiv1TailNumDiv2AirportDiv2AirportIDDiv2AirportSeqIDDiv2WheelsOnDiv2TotalGTimeDiv2LongestGTimeDiv2WheelsOffDiv2TailNumDiv3AirportDiv3AirportIDDiv3AirportSeqIDDiv3WheelsOnDiv3TotalGTimeDiv3LongestGTimeDiv3WheelsOffDiv3TailNumDiv4AirportDiv4AirportIDDiv4AirportSeqIDDiv4WheelsOnDiv4TotalGTimeDiv4LongestGTimeDiv4WheelsOffDiv4TailNumDiv5AirportDiv5AirportIDDiv5AirportSeqIDDiv5WheelsOnDiv5TotalGTimeDiv5LongestGTimeDiv5WheelsOffDiv5TailNumi64i64i64i64i64strstri64strstri64i64i64i64strstrstri64stri64i64i64i64strstrstri64stri64i64strf64f64f64i64strf64…strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstr202211145\"2022-01-14\"\"YX\"20452\"YX\"\"N119HQ\"487911066110660631066\"CMH\"\"Columbus, OH\"\"OH\"39\"Ohio\"4411278112780530852\"DCA\"\"Washington, DC…\"VA\"51\"Virginia\"381224\"1221\"-3.00.00.0-1\"1200-1259\"28.0…nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"null202211156\"2022-01-15\"\"YX\"20452\"YX\"\"N122HQ\"487911066110660631066\"CMH\"\"Columbus, OH\"\"OH\"39\"Ohio\"4411278112780530852\"DCA\"\"Washington, DC…\"VA\"51\"Virginia\"381224\"1214\"-10.00.00.0-1\"1200-1259\"19.0…nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"null202211167\"2022-01-16\"\"YX\"20452\"YX\"\"N412YX\"487911066110660631066\"CMH\"\"Columbus, OH\"\"OH\"39\"Ohio\"4411278112780530852\"DCA\"\"Washington, DC…\"VA\"51\"Virginia\"381224\"1218\"-6.00.00.0-1\"1200-1259\"16.0…nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"null………………………………………………………………………………………………………………………………………………………………………………………………………20221164\"2022-01-06\"\"DL\"19790\"DL\"\"N989AT\"157911057110570331057\"CLT\"\"Charlotte, NC\"\"NC\"37\"North Carolina…3610397103970730397\"ATL\"\"Atlanta, GA\"\"GA\"13\"Georgia\"341258\"1257\"-1.00.00.0-1\"1200-1259\"15.0…nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"null20221164\"2022-01-06\"\"DL\"19790\"DL\"\"N815DN\"158014869148690334614\"SLC\"\"Salt Lake City…\"UT\"49\"Utah\"8714057140570234057\"PDX\"\"Portland, OR\"\"OR\"41\"Oregon\"922240\"2231\"-9.00.00.0-1\"2200-2259\"10.0…nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"null\n\n\n\n\n\nimport pandas as pd\npd.options.display.max_rows = 5\ndf_pd = pd.read_csv(extracted)\ndf_pd\n\n/tmp/ipykernel_13811/2805799744.py:3: DtypeWarning:\n\nColumns (76,77,84) have mixed types. Specify dtype option on import or set low_memory=False.\n\n\n\n\n\n\n\n \n \n \n Year\n Quarter\n Month\n DayofMonth\n DayOfWeek\n FlightDate\n Reporting_Airline\n DOT_ID_Reporting_Airline\n IATA_CODE_Reporting_Airline\n Tail_Number\n ...\n Div4TailNum\n Div5Airport\n Div5AirportID\n Div5AirportSeqID\n Div5WheelsOn\n Div5TotalGTime\n Div5LongestGTime\n Div5WheelsOff\n Div5TailNum\n Unnamed: 109\n \n \n \n \n 0\n 2022\n 1\n 1\n 14\n 5\n 2022-01-14\n YX\n 20452\n YX\n N119HQ\n ...\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n \n \n 1\n 2022\n 1\n 1\n 15\n 6\n 2022-01-15\n YX\n 20452\n YX\n N122HQ\n ...\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n 537900\n 2022\n 1\n 1\n 6\n 4\n 2022-01-06\n DL\n 19790\n DL\n N989AT\n ...\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n \n \n 537901\n 2022\n 1\n 1\n 6\n 4\n 2022-01-06\n DL\n 19790\n DL\n N815DN\n ...\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n \n \n\n537902 rows × 110 columns"
+ "text": "1.2 Read the data\n\n\n\n\n\n\nTip\n\n\n\nThe examples in this book use the lazy evaluation feature of Polars less than you should. It’s just inconvenient to use the lazy API when displaying dozens of intermediate results for educational purposes.\n\n\n\nPolarsPandas\n\n\n\nimport polars as pl\npl.Config.set_tbl_rows(5) # don't print too many rows in the book\ndf_pl = pl.read_csv(extracted, truncate_ragged_lines=True)\ndf_pl\n\n\n\nshape: (537_902, 110)YearQuarterMonthDayofMonthDayOfWeekFlightDateReporting_AirlineDOT_ID_Reporting_AirlineIATA_CODE_Reporting_AirlineTail_NumberFlight_Number_Reporting_AirlineOriginAirportIDOriginAirportSeqIDOriginCityMarketIDOriginOriginCityNameOriginStateOriginStateFipsOriginStateNameOriginWacDestAirportIDDestAirportSeqIDDestCityMarketIDDestDestCityNameDestStateDestStateFipsDestStateNameDestWacCRSDepTimeDepTimeDepDelayDepDelayMinutesDepDel15DepartureDelayGroupsDepTimeBlkTaxiOut…Div1TotalGTimeDiv1LongestGTimeDiv1WheelsOffDiv1TailNumDiv2AirportDiv2AirportIDDiv2AirportSeqIDDiv2WheelsOnDiv2TotalGTimeDiv2LongestGTimeDiv2WheelsOffDiv2TailNumDiv3AirportDiv3AirportIDDiv3AirportSeqIDDiv3WheelsOnDiv3TotalGTimeDiv3LongestGTimeDiv3WheelsOffDiv3TailNumDiv4AirportDiv4AirportIDDiv4AirportSeqIDDiv4WheelsOnDiv4TotalGTimeDiv4LongestGTimeDiv4WheelsOffDiv4TailNumDiv5AirportDiv5AirportIDDiv5AirportSeqIDDiv5WheelsOnDiv5TotalGTimeDiv5LongestGTimeDiv5WheelsOffDiv5TailNumi64i64i64i64i64strstri64strstri64i64i64i64strstrstri64stri64i64i64i64strstrstri64stri64i64strf64f64f64i64strf64…strstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstrstr202211145\"2022-01-14\"\"YX\"20452\"YX\"\"N119HQ\"487911066110660631066\"CMH\"\"Columbus, OH\"\"OH\"39\"Ohio\"4411278112780530852\"DCA\"\"Washington, DC…\"VA\"51\"Virginia\"381224\"1221\"-3.00.00.0-1\"1200-1259\"28.0…nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"null202211156\"2022-01-15\"\"YX\"20452\"YX\"\"N122HQ\"487911066110660631066\"CMH\"\"Columbus, OH\"\"OH\"39\"Ohio\"4411278112780530852\"DCA\"\"Washington, DC…\"VA\"51\"Virginia\"381224\"1214\"-10.00.00.0-1\"1200-1259\"19.0…nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"null202211167\"2022-01-16\"\"YX\"20452\"YX\"\"N412YX\"487911066110660631066\"CMH\"\"Columbus, OH\"\"OH\"39\"Ohio\"4411278112780530852\"DCA\"\"Washington, DC…\"VA\"51\"Virginia\"381224\"1218\"-6.00.00.0-1\"1200-1259\"16.0…nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"null………………………………………………………………………………………………………………………………………………………………………………………………………20221164\"2022-01-06\"\"DL\"19790\"DL\"\"N989AT\"157911057110570331057\"CLT\"\"Charlotte, NC\"\"NC\"37\"North Carolina…3610397103970730397\"ATL\"\"Atlanta, GA\"\"GA\"13\"Georgia\"341258\"1257\"-1.00.00.0-1\"1200-1259\"15.0…nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"null20221164\"2022-01-06\"\"DL\"19790\"DL\"\"N815DN\"158014869148690334614\"SLC\"\"Salt Lake City…\"UT\"49\"Utah\"8714057140570234057\"PDX\"\"Portland, OR\"\"OR\"41\"Oregon\"922240\"2231\"-9.00.00.0-1\"2200-2259\"10.0…nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"\"\"nullnull\"\"nullnull\"\"\"\"null\n\n\n\n\n\nimport pandas as pd\npd.options.display.max_rows = 5\ndf_pd = pd.read_csv(extracted)\ndf_pd\n\n/tmp/ipykernel_28258/2805799744.py:3: DtypeWarning:\n\nColumns (76,77,84) have mixed types. Specify dtype option on import or set low_memory=False.\n\n\n\n\n\n\n\n \n \n \n Year\n Quarter\n Month\n DayofMonth\n DayOfWeek\n FlightDate\n Reporting_Airline\n DOT_ID_Reporting_Airline\n IATA_CODE_Reporting_Airline\n Tail_Number\n ...\n Div4TailNum\n Div5Airport\n Div5AirportID\n Div5AirportSeqID\n Div5WheelsOn\n Div5TotalGTime\n Div5LongestGTime\n Div5WheelsOff\n Div5TailNum\n Unnamed: 109\n \n \n \n \n 0\n 2022\n 1\n 1\n 14\n 5\n 2022-01-14\n YX\n 20452\n YX\n N119HQ\n ...\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n \n \n 1\n 2022\n 1\n 1\n 15\n 6\n 2022-01-15\n YX\n 20452\n YX\n N122HQ\n ...\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n 537900\n 2022\n 1\n 1\n 6\n 4\n 2022-01-06\n DL\n 19790\n DL\n N989AT\n ...\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n \n \n 537901\n 2022\n 1\n 1\n 6\n 4\n 2022-01-06\n DL\n 19790\n DL\n N815DN\n ...\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n NaN\n \n \n\n537902 rows × 110 columns"
},
{
"objectID": "indexing.html#indexing",
@@ -74,7 +74,7 @@
"href": "indexing.html#settingwithcopy",
"title": "1 Indexing (Or Lack Thereof)",
"section": "1.5 SettingWithCopy",
- "text": "1.5 SettingWithCopy\nPandas has this cute thing where if you assign values to some subset of the dataframe with square bracket indexing, it doesn’t work and gives the notorious SettingWithCopyWarning. To be fair, this warning also tells you to assign using .loc. Unfortunately many people in the Pandas community can’t read and instead just ignore the warning.\nPolars is not yet popular enough to attact the same crowd, but when it does it should not run into the same problem, as the only way to add or overwrite columns in Polars is the with_columns method.\n\nPolarsPandas (bad)Pandas (good)Pandas (better)\n\n\n\nf = pl.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})\nf.with_columns(\n pl.when(pl.col(\"a\") <= 3)\n .then(pl.col(\"b\") // 10)\n .otherwise(pl.col(\"b\"))\n)\n\n\n\nshape: (5, 2)abi64i64112233440550\n\n\n\n\n\nf = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})\nf[f['a'] <= 3]['b'] = f['b'] // 10\nf\n\n/tmp/ipykernel_13811/1317853993.py:2: SettingWithCopyWarning:\n\n\nA value is trying to be set on a copy of a slice from a DataFrame.\nTry using .loc[row_indexer,col_indexer] = value instead\n\nSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n\n\n\n\n\n\n\n \n \n \n a\n b\n \n \n \n \n 0\n 1\n 10\n \n \n 1\n 2\n 20\n \n \n 2\n 3\n 30\n \n \n 3\n 4\n 40\n \n \n 4\n 5\n 50\n \n \n\n\n\n\n\n\n\nf = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})\nf.loc[f['a'] <= 3, \"b\"] = f['b'] // 10\nf\n\n\n\n\n\n \n \n \n a\n b\n \n \n \n \n 0\n 1\n 1\n \n \n 1\n 2\n 2\n \n \n 2\n 3\n 3\n \n \n 3\n 4\n 40\n \n \n 4\n 5\n 50\n \n \n\n\n\n\n\n\n\nf = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})\nf.assign(b=f[\"b\"].mask(f[\"a\"] <=3, f[\"b\"] // 10))\n\n\n\n\n\n \n \n \n a\n b\n \n \n \n \n 0\n 1\n 1\n \n \n 1\n 2\n 2\n \n \n 2\n 3\n 3\n \n \n 3\n 4\n 40\n \n \n 4\n 5\n 50"
+ "text": "1.5 SettingWithCopy\nPandas has this cute thing where if you assign values to some subset of the dataframe with square bracket indexing, it doesn’t work and gives the notorious SettingWithCopyWarning. To be fair, this warning also tells you to assign using .loc. Unfortunately many people in the Pandas community can’t read and instead just ignore the warning.\nPolars is not yet popular enough to attact the same crowd, but when it does it should not run into the same problem, as the only way to add or overwrite columns in Polars is the with_columns method.\n\nPolarsPandas (bad)Pandas (good)Pandas (better)\n\n\n\nf = pl.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})\nf.with_columns(\n pl.when(pl.col(\"a\") <= 3)\n .then(pl.col(\"b\") // 10)\n .otherwise(pl.col(\"b\"))\n)\n\n\n\nshape: (5, 2)abi64i64112233440550\n\n\n\n\n\nf = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})\nf[f['a'] <= 3]['b'] = f['b'] // 10\nf\n\n/tmp/ipykernel_28258/1317853993.py:2: SettingWithCopyWarning:\n\n\nA value is trying to be set on a copy of a slice from a DataFrame.\nTry using .loc[row_indexer,col_indexer] = value instead\n\nSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n\n\n\n\n\n\n\n \n \n \n a\n b\n \n \n \n \n 0\n 1\n 10\n \n \n 1\n 2\n 20\n \n \n 2\n 3\n 30\n \n \n 3\n 4\n 40\n \n \n 4\n 5\n 50\n \n \n\n\n\n\n\n\n\nf = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})\nf.loc[f['a'] <= 3, \"b\"] = f['b'] // 10\nf\n\n\n\n\n\n \n \n \n a\n b\n \n \n \n \n 0\n 1\n 1\n \n \n 1\n 2\n 2\n \n \n 2\n 3\n 3\n \n \n 3\n 4\n 40\n \n \n 4\n 5\n 50\n \n \n\n\n\n\n\n\n\nf = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})\nf.assign(b=f[\"b\"].mask(f[\"a\"] <=3, f[\"b\"] // 10))\n\n\n\n\n\n \n \n \n a\n b\n \n \n \n \n 0\n 1\n 1\n \n \n 1\n 2\n 2\n \n \n 2\n 3\n 3\n \n \n 3\n 4\n 40\n \n \n 4\n 5\n 50"
},
{
"objectID": "indexing.html#summary",
@@ -109,7 +109,7 @@
"href": "method_chaining.html#example-plots",
"title": "2 Method Chaining",
"section": "2.4 Example plots",
- "text": "2.4 Example plots\n\n2.4.1 Daily Flights\nHere’s how plotting the number of daily flights looks in Polars and Pandas:\n\n\n\n\n\n\nNote\n\n\n\nPolars has its own built-in plotting with hvPlot but the time series plots are not great. Since most of the plots in this book are time series plots, we’ll just use .to_pandas() followed by .plot() function after doing all the data manipulation in Polars.\n\n\n\nPolarsPandas\n\n\n\n# filter for the busiest airlines\nfilter_expr = pl.col(\"IATA_CODE_Reporting_Airline\").is_in(\n pl.col(\"IATA_CODE_Reporting_Airline\")\n .value_counts(sort=True)\n .struct.field(\"IATA_CODE_Reporting_Airline\")\n .head(5)\n)\n(\n df_pl\n .drop_nulls(subset=[\"DepTime\", \"IATA_CODE_Reporting_Airline\"])\n .filter(filter_expr)\n .sort(\"DepTime\")\n .group_by_dynamic(\n \"DepTime\",\n every=\"1h\",\n by=\"IATA_CODE_Reporting_Airline\")\n .agg(pl.col(\"Flight_Number_Reporting_Airline\").count())\n .pivot(\n index=\"DepTime\",\n columns=\"IATA_CODE_Reporting_Airline\",\n values=\"Flight_Number_Reporting_Airline\",\n )\n .sort(\"DepTime\")\n # fill every missing hour with 0 so the plot looks better\n .upsample(time_column=\"DepTime\", every=\"1h\")\n .fill_null(0)\n .select([pl.col(\"DepTime\"), pl.col(pl.UInt32).rolling_sum(24)])\n .to_pandas()\n .set_index(\"DepTime\")\n .rename_axis(\"Flights per Day\", axis=1)\n .plot()\n)\n\n\n\n\n\n\n\n\n\n\n(\n df_pd\n .dropna(subset=[\"DepTime\", \"IATA_CODE_Reporting_Airline\"])\n # filter for the busiest airlines\n .loc[\n lambda x: x[\"IATA_CODE_Reporting_Airline\"].isin(\n x[\"IATA_CODE_Reporting_Airline\"].value_counts().index[:5]\n )\n ]\n .assign(\n IATA_CODE_Reporting_Airline=lambda x: x[\n \"IATA_CODE_Reporting_Airline\"\n ].cat.remove_unused_categories() # annoying pandas behaviour\n )\n .set_index(\"DepTime\")\n # TimeGrouper to resample & groupby at once\n .groupby([\"IATA_CODE_Reporting_Airline\", pd.Grouper(freq=\"H\")])[\n \"Flight_Number_Reporting_Airline\"\n ]\n .count()\n # the .pivot takes care of this in the Polars code.\n .unstack(0)\n .fillna(0)\n .rolling(24)\n .sum()\n .rename_axis(\"Flights per Day\", axis=1)\n .plot()\n)\n\n/tmp/ipykernel_13915/4223446110.py:17: FutureWarning:\n\nThe default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.\n\n\n\n\n\n\n\n\n\n\n\n\nDifferences between Polars and Pandas:\n\nTo group by a time window and another value, we use .groupby_dynamic. In Pandas we use .groupby with the pd.Grouper helper.\nInstead of .rolling(n).sum(), Polars has .rolling_sum(n).\nIf you see Pandas code using .unstack, the corresponding Polars code probably needs .pivot.\nIn Polars, .value_counts returns a pl.Struct column containing the value and the value count. In Pandas it returns a series where the elements are the value counts and the index contains the values themselves.\nIn Polars we need to select all the UInt32 cols at one point using pl.col(pl.UInt32). In Pandas, the way .rolling works means we don’t need to select these cols explicitly, but if we did it would look like df.select_dtypes(\"uint32\").\n\n\n\n2.4.2 Planes With Multiple Daily Flights\nNow let’s see if planes with multiple flights per day tend to get delayed as the day goes on:\n\nPolarsPandas\n\n\n\nflights_pl = (\n df_pl.select(\n pl.col([\n \"FlightDate\",\n \"Tail_Number\",\n \"DepTime\",\n \"DepDelay\"\n ])\n )\n .drop_nulls()\n .sort(\"DepTime\")\n .filter(pl.col(\"DepDelay\") < 500)\n .with_columns(\n pl.col(\"DepTime\")\n .rank()\n .over([\"FlightDate\", \"Tail_Number\"])\n .alias(\"turn\")\n )\n)\n\nfig, ax = plt.subplots(figsize=(10, 5))\nsns.boxplot(x=\"turn\", y=\"DepDelay\", data=flights_pl, ax=ax)\nax.set_ylim(-50, 50)\n\n(-50.0, 50.0)\n\n\n\n\n\n\n\n\nflights_pd = (\n df_pd[[\n \"FlightDate\",\n \"Tail_Number\",\n \"DepTime\",\n \"DepDelay\"\n ]]\n .dropna()\n .sort_values('DepTime')\n .loc[lambda x: x[\"DepDelay\"] < 500]\n .assign(turn = lambda x:\n x.groupby([\"FlightDate\", \"Tail_Number\"])\n [\"DepTime\"].transform('rank')\n .astype(int)\n )\n)\nfig, ax = plt.subplots(figsize=(10, 5))\nsns.boxplot(x=\"turn\", y=\"DepDelay\", data=flights_pd, ax=ax)\nax.set_ylim(-50, 50)\n\n/tmp/ipykernel_13915/2848021590.py:12: FutureWarning:\n\nThe default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.\n\n\n\n(-50.0, 50.0)\n\n\n\n\n\n\n\n\nOne new thing here: window functions. When Pandas code looks like:\n.groupby(\"country\")[\"population\"].transform(\"sum\")\nthe equivalent Polars code will look like:\npl.col(\"population\").sum().over(\"country\")\n\n\n2.4.3 Delay by hour of the day\nMaybe later flights have longer delays:\n\nPolarsPandas\n\n\n\nplt.figure(figsize=(10, 5))\n(\n df_pl.select(\n pl.col(\n [\"FlightDate\", \"Tail_Number\", \"DepTime\", \"DepDelay\"],\n )\n )\n .drop_nulls()\n .filter(pl.col(\"DepDelay\").is_between(5, 600, closed=\"none\"))\n .with_columns(pl.col(\"DepTime\").dt.hour().alias(\"hour\"))\n .to_pandas()\n .pipe((sns.boxplot, \"data\"), x=\"hour\", y=\"DepDelay\")\n)\n\n\n\n\n\n\n\n\n\n\nplt.figure(figsize=(10, 5))\n(\n df_pd[[\"FlightDate\", \"Tail_Number\", \"DepTime\", \"DepDelay\"]]\n .dropna()\n .loc[lambda df: df[\"DepDelay\"].between(5, 600, inclusive=\"neither\")]\n .assign(hour=lambda df: df[\"DepTime\"].dt.hour)\n .pipe((sns.boxplot, \"data\"), x=\"hour\", y=\"DepDelay\")\n)\n\n"
+ "text": "2.4 Example plots\n\n2.4.1 Daily Flights\nHere’s how plotting the number of daily flights looks in Polars and Pandas:\n\n\n\n\n\n\nNote\n\n\n\nPolars has its own built-in plotting with hvPlot but the time series plots are not great. Since most of the plots in this book are time series plots, we’ll just use .to_pandas() followed by .plot() function after doing all the data manipulation in Polars.\n\n\n\nPolarsPandas\n\n\n\n# filter for the busiest airlines\nfilter_expr = pl.col(\"IATA_CODE_Reporting_Airline\").is_in(\n pl.col(\"IATA_CODE_Reporting_Airline\")\n .value_counts(sort=True)\n .struct.field(\"IATA_CODE_Reporting_Airline\")\n .head(5)\n)\n(\n df_pl\n .drop_nulls(subset=[\"DepTime\", \"IATA_CODE_Reporting_Airline\"])\n .filter(filter_expr)\n .sort(\"DepTime\")\n .group_by_dynamic(\n \"DepTime\",\n every=\"1h\",\n by=\"IATA_CODE_Reporting_Airline\")\n .agg(pl.col(\"Flight_Number_Reporting_Airline\").count())\n .pivot(\n index=\"DepTime\",\n columns=\"IATA_CODE_Reporting_Airline\",\n values=\"Flight_Number_Reporting_Airline\",\n )\n .sort(\"DepTime\")\n # fill every missing hour with 0 so the plot looks better\n .upsample(time_column=\"DepTime\", every=\"1h\")\n .fill_null(0)\n .select([pl.col(\"DepTime\"), pl.col(pl.UInt32).rolling_sum(24)])\n .to_pandas()\n .set_index(\"DepTime\")\n .rename_axis(\"Flights per Day\", axis=1)\n .plot()\n)\n\n\n\n\n\n\n\n\n\n\n(\n df_pd\n .dropna(subset=[\"DepTime\", \"IATA_CODE_Reporting_Airline\"])\n # filter for the busiest airlines\n .loc[\n lambda x: x[\"IATA_CODE_Reporting_Airline\"].isin(\n x[\"IATA_CODE_Reporting_Airline\"].value_counts().index[:5]\n )\n ]\n .assign(\n IATA_CODE_Reporting_Airline=lambda x: x[\n \"IATA_CODE_Reporting_Airline\"\n ].cat.remove_unused_categories() # annoying pandas behaviour\n )\n .set_index(\"DepTime\")\n # TimeGrouper to resample & groupby at once\n .groupby([\"IATA_CODE_Reporting_Airline\", pd.Grouper(freq=\"H\")])[\n \"Flight_Number_Reporting_Airline\"\n ]\n .count()\n # the .pivot takes care of this in the Polars code.\n .unstack(0)\n .fillna(0)\n .rolling(24)\n .sum()\n .rename_axis(\"Flights per Day\", axis=1)\n .plot()\n)\n\n/tmp/ipykernel_28342/4223446110.py:17: FutureWarning:\n\nThe default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.\n\n\n\n\n\n\n\n\n\n\n\n\nDifferences between Polars and Pandas:\n\nTo group by a time window and another value, we use .groupby_dynamic. In Pandas we use .groupby with the pd.Grouper helper.\nInstead of .rolling(n).sum(), Polars has .rolling_sum(n).\nIf you see Pandas code using .unstack, the corresponding Polars code probably needs .pivot.\nIn Polars, .value_counts returns a pl.Struct column containing the value and the value count. In Pandas it returns a series where the elements are the value counts and the index contains the values themselves.\nIn Polars we need to select all the UInt32 cols at one point using pl.col(pl.UInt32). In Pandas, the way .rolling works means we don’t need to select these cols explicitly, but if we did it would look like df.select_dtypes(\"uint32\").\n\n\n\n2.4.2 Planes With Multiple Daily Flights\nNow let’s see if planes with multiple flights per day tend to get delayed as the day goes on:\n\nPolarsPandas\n\n\n\nflights_pl = (\n df_pl.select(\n pl.col([\n \"FlightDate\",\n \"Tail_Number\",\n \"DepTime\",\n \"DepDelay\"\n ])\n )\n .drop_nulls()\n .sort(\"DepTime\")\n .filter(pl.col(\"DepDelay\") < 500)\n .with_columns(\n pl.col(\"DepTime\")\n .rank()\n .over([\"FlightDate\", \"Tail_Number\"])\n .alias(\"turn\")\n )\n)\n\nfig, ax = plt.subplots(figsize=(10, 5))\nsns.boxplot(x=\"turn\", y=\"DepDelay\", data=flights_pl, ax=ax)\nax.set_ylim(-50, 50)\n\n(-50.0, 50.0)\n\n\n\n\n\n\n\n\nflights_pd = (\n df_pd[[\n \"FlightDate\",\n \"Tail_Number\",\n \"DepTime\",\n \"DepDelay\"\n ]]\n .dropna()\n .sort_values('DepTime')\n .loc[lambda x: x[\"DepDelay\"] < 500]\n .assign(turn = lambda x:\n x.groupby([\"FlightDate\", \"Tail_Number\"])\n [\"DepTime\"].transform('rank')\n .astype(int)\n )\n)\nfig, ax = plt.subplots(figsize=(10, 5))\nsns.boxplot(x=\"turn\", y=\"DepDelay\", data=flights_pd, ax=ax)\nax.set_ylim(-50, 50)\n\n/tmp/ipykernel_28342/2848021590.py:12: FutureWarning:\n\nThe default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.\n\n\n\n(-50.0, 50.0)\n\n\n\n\n\n\n\n\nOne new thing here: window functions. When Pandas code looks like:\n.groupby(\"country\")[\"population\"].transform(\"sum\")\nthe equivalent Polars code will look like:\npl.col(\"population\").sum().over(\"country\")\n\n\n2.4.3 Delay by hour of the day\nMaybe later flights have longer delays:\n\nPolarsPandas\n\n\n\nplt.figure(figsize=(10, 5))\n(\n df_pl.select(\n pl.col(\n [\"FlightDate\", \"Tail_Number\", \"DepTime\", \"DepDelay\"],\n )\n )\n .drop_nulls()\n .filter(pl.col(\"DepDelay\").is_between(5, 600, closed=\"none\"))\n .with_columns(pl.col(\"DepTime\").dt.hour().alias(\"hour\"))\n .to_pandas()\n .pipe((sns.boxplot, \"data\"), x=\"hour\", y=\"DepDelay\")\n)\n\n\n\n\n\n\n\n\n\n\nplt.figure(figsize=(10, 5))\n(\n df_pd[[\"FlightDate\", \"Tail_Number\", \"DepTime\", \"DepDelay\"]]\n .dropna()\n .loc[lambda df: df[\"DepDelay\"].between(5, 600, inclusive=\"neither\")]\n .assign(hour=lambda df: df[\"DepTime\"].dt.hour)\n .pipe((sns.boxplot, \"data\"), x=\"hour\", y=\"DepDelay\")\n)\n\n"
},
{
"objectID": "method_chaining.html#how-much-is-too-much",
@@ -137,21 +137,21 @@
"href": "performance.html#polars-is-faster-at-the-boring-stuff",
"title": "3 Performance",
"section": "3.2 Polars is faster at the boring stuff",
- "text": "3.2 Polars is faster at the boring stuff\nHere we’ll clean up a messy dataset, kindly provided by Kaggle user Rachit Toshniwal as a deliberate example of a really crap CSV. Most of the cleanup involves extracting numeric data from awkward strings.\nAlso, the data is too small so I’ve concatenated it to itself 20 times. We’re not doing anything that will care about the duplication. Here’s how the raw table looks:\n\n\nCode\nimport pandas as pd\npd.read_csv(\"../data/fifa21_raw_big.csv\", dtype=\"string\", nrows=2)\n\n\n\n\n\n\n \n \n \n ID\n Name\n LongName\n photoUrl\n playerUrl\n Nationality\n Age\n ↓OVA\n POT\n Club\n ...\n A/W\n D/W\n IR\n PAC\n SHO\n PAS\n DRI\n DEF\n PHY\n Hits\n \n \n \n \n 0\n 158023\n L. Messi\n Lionel Messi\n https://cdn.sofifa.com/players/158/023/21_60.png\n http://sofifa.com/player/158023/lionel-messi/2...\n Argentina\n 33\n 93\n 93\n FC Barcelona\n ...\n Medium\n Low\n 5 ★\n 85\n 92\n 91\n 95\n 38\n 65\n 771\n \n \n 1\n 20801\n Cristiano Ronaldo\n C. Ronaldo dos Santos Aveiro\n https://cdn.sofifa.com/players/020/801/21_60.png\n http://sofifa.com/player/20801/c-ronaldo-dos-s...\n Portugal\n 35\n 92\n 92\n Juventus\n ...\n High\n Low\n 5 ★\n 89\n 93\n 81\n 89\n 35\n 77\n 562\n \n \n\n2 rows × 77 columns\n\n\n\nFor this exercise we’ll assume we want to make use of all the columns. First some boilerplate where we map out the different data types:\n\n\nCode\nimport pandas as pd\nimport polars as pl\nimport numpy as np\nimport math\nstr_cols = [\n \"Name\",\n \"LongName\",\n \"playerUrl\",\n \"photoUrl\",\n]\ninitial_category_cols_pl = [\n \"Nationality\",\n \"Preferred Foot\",\n \"Best Position\",\n \"A/W\",\n \"D/W\"\n]\ncategory_cols = [*initial_category_cols_pl, \"Club\"]\ndate_cols = [\n \"Joined\",\n \"Loan Date End\"\n]\n# these all start with the euro symbol and end with 0, M or K\nmoney_cols = [\n \"Value\",\n \"Wage\",\n \"Release Clause\"\n]\nstar_cols = [\n \"W/F\",\n \"SM\",\n \"IR\",\n]\n# Contract col is a range of years\n# Positions is a list of positions\n# Height is in cm\n# Weight is in kg\n# Hits is numbers with K and M \nmessy_cols = [\n \"Contract\",\n \"Positions\",\n \"Height\",\n \"Weight\",\n \"Hits\"\n]\ninitially_str_cols = str_cols + date_cols + money_cols + star_cols + messy_cols\ninitially_str_cols_pl = [*initially_str_cols, \"Club\"]\nu32_cols = [\n \"ID\",\n \"Total Stats\"\n]\nu8_cols = [\n 'Age',\n '↓OVA',\n 'POT',\n 'BOV',\n 'Crossing',\n 'Finishing',\n 'Heading Accuracy',\n 'Short Passing',\n 'Volleys',\n 'Dribbling',\n 'Curve',\n 'FK Accuracy',\n 'Long Passing',\n 'Ball Control',\n 'Acceleration',\n 'Sprint Speed',\n 'Agility',\n 'Reactions',\n 'Balance',\n 'Shot Power',\n 'Jumping',\n 'Stamina',\n 'Strength',\n 'Long Shots',\n 'Aggression',\n 'Interceptions',\n 'Positioning',\n 'Vision',\n 'Penalties',\n 'Composure',\n 'Marking',\n 'Standing Tackle',\n 'Sliding Tackle',\n 'GK Diving',\n 'GK Handling',\n 'GK Kicking',\n 'GK Positioning',\n 'GK Reflexes',\n 'PAC',\n 'SHO',\n 'PAS',\n 'DRI',\n 'DEF',\n 'PHY'\n]\n\nu16_cols = [\n 'Attacking',\n 'Skill',\n 'Movement',\n 'Power',\n 'Mentality',\n 'Defending',\n 'Goalkeeping',\n 'Total Stats',\n 'Base Stats'\n]\n\n\n\n3.2.1 Dtypes\nHere are the initial dtypes for the two dataframes:\n\nPolarsPandas\n\n\n\n# can't use UInt8/16 in scan_csv\ndtypes_pl = (\n {col: pl.Utf8 for col in initially_str_cols_pl}\n | {col: pl.Categorical for col in initial_category_cols_pl}\n | {col: pl.UInt32 for col in [*u32_cols, *u16_cols, *u8_cols]}\n)\n\n\n\n\ndtypes_pd = (\n {col: pd.StringDtype() for col in initially_str_cols}\n | {col: pd.CategoricalDtype() for col in category_cols}\n | {col: \"uint32\" for col in u32_cols}\n | {col: \"uint8\" for col in u8_cols}\n | {col: \"uint16\" for col in u16_cols}\n)\n\n\n\n\nOne thing I’ll note here is that Pandas numeric types are somewhat confusing: \"uint32\" means np.uint32 which is not the same thing as pd.UInt32Dtype(). Only the latter is nullable. On the other hand, Polars has just one unsigned 32-bit integer type, and it’s nullable.\n\n\n\n\n\n\nTip\n\n\n\nPolars expressions have a shrink_dtype method that can be more convenient than manually specifying the dtypes yourself. It’s not magic though, and it has to spend time finding the min and max of the column.\n\n\n\n\n3.2.2 Data cleaning\nThere’s not much that you haven’t seen here already, so we won’t explain the code line by line. The main new thing here is pl.when for ternary expressions.\n\nPolarsPandas\n\n\n\ndef parse_date_pl(col: pl.Expr) -> pl.Expr:\n return col.str.strptime(pl.Date, format=\"%b %d, %Y\")\n\ndef parse_suffixed_num_pl(col: pl.Expr) -> pl.Expr:\n suffix = col.str.slice(-1, 1)\n suffix_value = (\n pl.when(suffix == \"K\")\n .then(1_000)\n .when(suffix == \"M\")\n .then(1_000_000)\n .otherwise(1)\n .cast(pl.UInt32)\n )\n without_suffix = (\n col\n .str.replace(\"K\", \"\", literal=True)\n .str.replace(\"M\", \"\", literal=True)\n .cast(pl.Float32)\n )\n original_name = col.meta.output_name()\n return (suffix_value * without_suffix).alias(original_name)\n\ndef parse_money_pl(col: pl.Expr) -> pl.Expr:\n return parse_suffixed_num_pl(col.str.slice(1)).cast(pl.UInt32)\n\ndef parse_star_pl(col: pl.Expr) -> pl.Expr:\n return col.str.slice(0, 1).cast(pl.UInt8)\n\ndef feet_to_cm_pl(col: pl.Expr) -> pl.Expr:\n feet_inches_split = col.str.split_exact(\"'\", 1)\n total_inches = (\n (feet_inches_split.struct.field(\"field_0\").cast(pl.UInt8, strict=False) * 12)\n + feet_inches_split.struct.field(\"field_1\").str.strip_chars_end('\"').cast(pl.UInt8, strict=False)\n )\n return (total_inches * 2.54).round(0).cast(pl.UInt8)\n\ndef parse_height_pl(col: pl.Expr) -> pl.Expr:\n is_cm = col.str.ends_with(\"cm\")\n return (\n pl.when(is_cm)\n .then(col.str.slice(0, 3).cast(pl.UInt8, strict=False))\n .otherwise(feet_to_cm_pl(col))\n )\n\ndef parse_weight_pl(col: pl.Expr) -> pl.Expr:\n is_kg = col.str.ends_with(\"kg\")\n without_unit = col.str.extract(r\"(\\d+)\").cast(pl.UInt8)\n return (\n pl.when(is_kg)\n .then(without_unit)\n .otherwise((without_unit * 0.453592).round(0).cast(pl.UInt8))\n )\n\ndef parse_contract_pl(col: pl.Expr) -> list[pl.Expr]:\n contains_tilde = col.str.contains(\" ~ \", literal=True)\n loan_str = \" On Loan\"\n loan_col = col.str.ends_with(loan_str)\n split = (\n pl.when(contains_tilde)\n .then(col)\n .otherwise(None)\n .str.split_exact(\" ~ \", 1)\n )\n start = split.struct.field(\"field_0\").cast(pl.UInt16).alias(\"contract_start\")\n end = split.struct.field(\"field_1\").cast(pl.UInt16).alias(\"contract_end\")\n free_agent = (col == \"Free\").alias(\"free_agent\").fill_null(False)\n loan_date = (\n pl.when(loan_col)\n .then(col)\n .otherwise(None)\n .str.split_exact(\" On Loan\", 1)\n .struct.field(\"field_0\")\n .alias(\"loan_date_start\")\n )\n return [start, end, free_agent, parse_date_pl(loan_date)]\n\n\n\n\ndef parse_date_pd(col: pd.Series) -> pd.Series:\n return pd.to_datetime(col, format=\"%b %d, %Y\")\n\ndef parse_suffixed_num_pd(col: pd.Series) -> pd.Series:\n suffix_value = (\n col\n .str[-1]\n .map({\"K\": 1_000, \"M\": 1_000_000})\n .fillna(1)\n .astype(\"uint32\")\n )\n without_suffix = (\n col\n .str.replace(\"K\", \"\", regex=False)\n .str.replace(\"M\", \"\", regex=False)\n .astype(\"float\")\n )\n return suffix_value * without_suffix\n\ndef parse_money_pd(col: pd.Series) -> pd.Series:\n return parse_suffixed_num_pd(col.str[1:]).astype(\"uint32\")\n\ndef parse_star_pd(col: pd.Series) -> pd.Series:\n return col.str[0].astype(\"uint8\")\n\ndef feet_to_cm_pd(col: pd.Series) -> pd.Series:\n feet_inches_split = col.str.split(\"'\", expand=True)\n total_inches = (\n feet_inches_split[0].astype(\"uint8\").mul(12)\n + feet_inches_split[1].str[:-1].astype(\"uint8\")\n )\n return total_inches.mul(2.54).round().astype(\"uint8\")\n\ndef parse_height_pd(col: pd.Series) -> pd.Series:\n is_cm = col.str.endswith(\"cm\")\n cm_values = col.loc[is_cm].str[:-2].astype(\"uint8\")\n inches_as_cm = feet_to_cm_pd(col.loc[~is_cm])\n return pd.concat([cm_values, inches_as_cm])\n\ndef parse_weight_pd(col: pd.Series) -> pd.Series:\n is_kg = col.str.endswith(\"kg\")\n without_unit = col.where(is_kg, col.str[:-3]).mask(is_kg, col.str[:-2]).astype(\"uint8\")\n return without_unit.where(is_kg, without_unit.mul(0.453592).round().astype(\"uint8\"))\n\ndef parse_contract_pd(df: pd.DataFrame) -> pd.DataFrame:\n contract_col = df[\"Contract\"]\n contains_tilde = contract_col.str.contains(\" ~ \", regex=False)\n split = (\n contract_col.loc[contains_tilde].str.split(\" ~ \", expand=True).astype(pd.UInt16Dtype())\n )\n split.columns = [\"contract_start\", \"contract_end\"]\n not_tilde = contract_col.loc[~contains_tilde]\n free_agent = (contract_col == \"Free\").rename(\"free_agent\").fillna(False)\n loan_date = parse_date_pd(not_tilde.loc[~free_agent].str[:-8]).rename(\"loan_date_start\")\n return pd.concat([df.drop(\"Contract\", axis=1), split, free_agent, loan_date], axis=1)\n\n\n\n\n\n\n3.2.3 Performance comparison\nIn this example, Polars is ~150x faster than Pandas:\n\nPolarsPandas\n\n\n\n%%time\nnew_cols_pl = ([\n pl.col(\"Club\").str.strip_chars().cast(pl.Categorical),\n parse_suffixed_num_pl(pl.col(\"Hits\")).cast(pl.UInt32),\n pl.col(\"Positions\").str.split(\",\"),\n parse_height_pl(pl.col(\"Height\")),\n parse_weight_pl(pl.col(\"Weight\")),\n]\n+ [parse_date_pl(pl.col(col)) for col in date_cols]\n+ [parse_money_pl(pl.col(col)) for col in money_cols]\n+ [parse_star_pl(pl.col(col)) for col in star_cols]\n+ parse_contract_pl(pl.col(\"Contract\"))\n+ [pl.col(col).cast(pl.UInt16) for col in u16_cols]\n+ [pl.col(col).cast(pl.UInt8) for col in u8_cols]\n)\nfifa_pl = (\n pl.scan_csv(\"../data/fifa21_raw_v2.csv\", dtypes=dtypes_pl)\n .with_columns(new_cols_pl)\n .drop(\"Contract\")\n .rename({\"↓OVA\": \"OVA\"})\n .collect()\n)\n\nCPU times: user 182 ms, sys: 26.7 ms, total: 209 ms\nWall time: 52 ms\n\n\n\n\n\n%%time\nfifa_pd = (\n pd.read_csv(\"../data/fifa21_raw_big.csv\", dtype=dtypes_pd)\n .assign(Club=lambda df: df[\"Club\"].cat.rename_categories(lambda c: c.strip()),\n **{col: lambda df: parse_date_pd(df[col]) for col in date_cols},\n **{col: lambda df: parse_money_pd(df[col]) for col in money_cols},\n **{col: lambda df: parse_star_pd(df[col]) for col in star_cols},\n Hits=lambda df: parse_suffixed_num_pd(df[\"Hits\"]).astype(pd.UInt32Dtype()),\n Positions=lambda df: df[\"Positions\"].str.split(\",\"),\n Height=lambda df: parse_height_pd(df[\"Height\"]),\n Weight=lambda df: parse_weight_pd(df[\"Weight\"])\n )\n .pipe(parse_contract_pd)\n .rename(columns={\"↓OVA\": \"OVA\"})\n)\n\nCPU times: user 7.14 s, sys: 401 ms, total: 7.54 s\nWall time: 7.57 s\n\n\n\n\n\nOutput:\n\nPolarsPandas\n\n\n\nfifa_pl.head()\n\n\n\nshape: (5, 80)IDNameLongNamephotoUrlplayerUrlNationalityAgeOVAPOTClubPositionsHeightWeightPreferred FootBOVBest PositionJoinedLoan Date EndValueWageRelease ClauseAttackingCrossingFinishingHeading AccuracyShort PassingVolleysSkillDribblingCurveFK AccuracyLong PassingBall ControlMovementAccelerationSprint SpeedAgility…StrengthLong ShotsMentalityAggressionInterceptionsPositioningVisionPenaltiesComposureDefendingMarkingStanding TackleSliding TackleGoalkeepingGK DivingGK HandlingGK KickingGK PositioningGK ReflexesTotal StatsBase StatsW/FSMA/WD/WIRPACSHOPASDRIDEFPHYHitscontract_startcontract_endfree_agentloan_date_startu32strstrstrstrcatu8u8u8catlist[str]u8u8catu8catdatedateu32u32u32u16u8u8u8u8u8u16u8u8u8u8u8u16u8u8u8…u8u8u16u8u8u8u8u8u8u16u8u8u8u16u8u8u8u8u8u16u16u8u8catcatu8u8u8u8u8u8u8u32u16u16booldate158023\"L. Messi\"\"Lionel Messi\"\"https://cdn.so…\"http://sofifa.…\"Argentina\"339393\"FC Barcelona\"[\"RW\", \" ST\", \" CF\"]17072\"Left\"93\"RW\"2004-07-01null10350000056000013839999342985957091884709693949196451918091…6994347444093957596913235245461115148223146644\"Medium\"\"Low\"585929195386577120042021falsenull20801\"Cristiano Rona…\"C. Ronaldo dos…\"https://cdn.so…\"http://sofifa.…\"Portugal\"359292\"Juventus\"[\"ST\", \" LW\"]18783\"Right\"92\"ST\"2018-07-10null630000002200007590000143784959082864148881767792431879187…78933536329958284958428322458711151411222146445\"High\"\"Low\"589938189357756220182022falsenull200389\"J. Oblak\"\"Jan Oblak\"\"https://cdn.so…\"http://sofifa.…\"Slovenia\"279193\"Atlético Madri…[\"GK\"]18887\"Right\"91\"GK\"2014-07-16null1200000001250001593999939513111543131091213144030307436067…7812140341911651168572712184378792789090141348931\"Medium\"\"Medium\"387927890529015020142023falsenull192985\"K. De Bruyne\"\"Kevin De Bruyn…\"https://cdn.so…\"http://sofifa.…\"Belgium\"299191\"Manchester Cit…[\"CAM\", \" CM\"]18170\"Right\"91\"CAM\"2015-08-30null12900000037000016100000040794825594824418885839392398777678…749140876668894849118668655356151351013230448554\"High\"\"High\"476869388647820720152023falsenull190871\"Neymar Jr\"\"Neymar da Silv…\"https://cdn.so…\"http://sofifa.…\"Brazil\"289191\"Paris Saint-Ge…[\"LW\", \" CAM\"]17568\"Right\"91\"LW\"2017-08-03null13200000027000016650000040885876287874489588898195453948996…5084356513687909293943530295999151511217545155\"High\"\"Medium\"591858694365959520172022falsenull\n\n\n\n\n\nfifa_pd.head()\n\n\n\n\n\n \n \n \n ID\n Name\n LongName\n photoUrl\n playerUrl\n Nationality\n Age\n OVA\n POT\n Club\n ...\n SHO\n PAS\n DRI\n DEF\n PHY\n Hits\n contract_start\n contract_end\n free_agent\n loan_date_start\n \n \n \n \n 0\n 158023\n L. Messi\n Lionel Messi\n https://cdn.sofifa.com/players/158/023/21_60.png\n http://sofifa.com/player/158023/lionel-messi/2...\n Argentina\n 33\n 93\n 93\n FC Barcelona\n ...\n 92\n 91\n 95\n 38\n 65\n 771\n 2004\n 2021\n False\n NaT\n \n \n 1\n 20801\n Cristiano Ronaldo\n C. Ronaldo dos Santos Aveiro\n https://cdn.sofifa.com/players/020/801/21_60.png\n http://sofifa.com/player/20801/c-ronaldo-dos-s...\n Portugal\n 35\n 92\n 92\n Juventus\n ...\n 93\n 81\n 89\n 35\n 77\n 562\n 2018\n 2022\n False\n NaT\n \n \n 2\n 200389\n J. Oblak\n Jan Oblak\n https://cdn.sofifa.com/players/200/389/21_60.png\n http://sofifa.com/player/200389/jan-oblak/210006/\n Slovenia\n 27\n 91\n 93\n Atlético Madrid\n ...\n 92\n 78\n 90\n 52\n 90\n 150\n 2014\n 2023\n False\n NaT\n \n \n 3\n 192985\n K. De Bruyne\n Kevin De Bruyne\n https://cdn.sofifa.com/players/192/985/21_60.png\n http://sofifa.com/player/192985/kevin-de-bruyn...\n Belgium\n 29\n 91\n 91\n Manchester City\n ...\n 86\n 93\n 88\n 64\n 78\n 207\n 2015\n 2023\n False\n NaT\n \n \n 4\n 190871\n Neymar Jr\n Neymar da Silva Santos Jr.\n https://cdn.sofifa.com/players/190/871/21_60.png\n http://sofifa.com/player/190871/neymar-da-silv...\n Brazil\n 28\n 91\n 91\n Paris Saint-Germain\n ...\n 85\n 86\n 94\n 36\n 59\n 595\n 2017\n 2022\n False\n NaT\n \n \n\n5 rows × 80 columns\n\n\n\n\n\n\nYou could play around with the timings here and even try the .profile method to see what Polars spends its time on. In this scenario the speed advantage of Polars likely comes down to three things:\n\nIt is much faster at reading CSVs.\nIt is much faster at processing strings.\nIt can select/assign columns in parallel."
+ "text": "3.2 Polars is faster at the boring stuff\nHere we’ll clean up a messy dataset, kindly provided by Kaggle user Rachit Toshniwal as a deliberate example of a really crap CSV. Most of the cleanup involves extracting numeric data from awkward strings.\nAlso, the data is too small so I’ve concatenated it to itself 20 times. We’re not doing anything that will care about the duplication. Here’s how the raw table looks:\n\n\nCode\nimport pandas as pd\npd.read_csv(\"../data/fifa21_raw_big.csv\", dtype=\"string\", nrows=2)\n\n\n\n\n\n\n \n \n \n ID\n Name\n LongName\n photoUrl\n playerUrl\n Nationality\n Age\n ↓OVA\n POT\n Club\n ...\n A/W\n D/W\n IR\n PAC\n SHO\n PAS\n DRI\n DEF\n PHY\n Hits\n \n \n \n \n 0\n 158023\n L. Messi\n Lionel Messi\n https://cdn.sofifa.com/players/158/023/21_60.png\n http://sofifa.com/player/158023/lionel-messi/2...\n Argentina\n 33\n 93\n 93\n FC Barcelona\n ...\n Medium\n Low\n 5 ★\n 85\n 92\n 91\n 95\n 38\n 65\n 771\n \n \n 1\n 20801\n Cristiano Ronaldo\n C. Ronaldo dos Santos Aveiro\n https://cdn.sofifa.com/players/020/801/21_60.png\n http://sofifa.com/player/20801/c-ronaldo-dos-s...\n Portugal\n 35\n 92\n 92\n Juventus\n ...\n High\n Low\n 5 ★\n 89\n 93\n 81\n 89\n 35\n 77\n 562\n \n \n\n2 rows × 77 columns\n\n\n\nFor this exercise we’ll assume we want to make use of all the columns. First some boilerplate where we map out the different data types:\n\n\nCode\nimport pandas as pd\nimport polars as pl\nimport numpy as np\nimport math\nstr_cols = [\n \"Name\",\n \"LongName\",\n \"playerUrl\",\n \"photoUrl\",\n]\ninitial_category_cols_pl = [\n \"Nationality\",\n \"Preferred Foot\",\n \"Best Position\",\n \"A/W\",\n \"D/W\"\n]\ncategory_cols = [*initial_category_cols_pl, \"Club\"]\ndate_cols = [\n \"Joined\",\n \"Loan Date End\"\n]\n# these all start with the euro symbol and end with 0, M or K\nmoney_cols = [\n \"Value\",\n \"Wage\",\n \"Release Clause\"\n]\nstar_cols = [\n \"W/F\",\n \"SM\",\n \"IR\",\n]\n# Contract col is a range of years\n# Positions is a list of positions\n# Height is in cm\n# Weight is in kg\n# Hits is numbers with K and M \nmessy_cols = [\n \"Contract\",\n \"Positions\",\n \"Height\",\n \"Weight\",\n \"Hits\"\n]\ninitially_str_cols = str_cols + date_cols + money_cols + star_cols + messy_cols\ninitially_str_cols_pl = [*initially_str_cols, \"Club\"]\nu32_cols = [\n \"ID\",\n \"Total Stats\"\n]\nu8_cols = [\n 'Age',\n '↓OVA',\n 'POT',\n 'BOV',\n 'Crossing',\n 'Finishing',\n 'Heading Accuracy',\n 'Short Passing',\n 'Volleys',\n 'Dribbling',\n 'Curve',\n 'FK Accuracy',\n 'Long Passing',\n 'Ball Control',\n 'Acceleration',\n 'Sprint Speed',\n 'Agility',\n 'Reactions',\n 'Balance',\n 'Shot Power',\n 'Jumping',\n 'Stamina',\n 'Strength',\n 'Long Shots',\n 'Aggression',\n 'Interceptions',\n 'Positioning',\n 'Vision',\n 'Penalties',\n 'Composure',\n 'Marking',\n 'Standing Tackle',\n 'Sliding Tackle',\n 'GK Diving',\n 'GK Handling',\n 'GK Kicking',\n 'GK Positioning',\n 'GK Reflexes',\n 'PAC',\n 'SHO',\n 'PAS',\n 'DRI',\n 'DEF',\n 'PHY'\n]\n\nu16_cols = [\n 'Attacking',\n 'Skill',\n 'Movement',\n 'Power',\n 'Mentality',\n 'Defending',\n 'Goalkeeping',\n 'Total Stats',\n 'Base Stats'\n]\n\n\n\n3.2.1 Dtypes\nHere are the initial dtypes for the two dataframes:\n\nPolarsPandas\n\n\n\n# can't use UInt8/16 in scan_csv\ndtypes_pl = (\n {col: pl.Utf8 for col in initially_str_cols_pl}\n | {col: pl.Categorical for col in initial_category_cols_pl}\n | {col: pl.UInt32 for col in [*u32_cols, *u16_cols, *u8_cols]}\n)\n\n\n\n\ndtypes_pd = (\n {col: pd.StringDtype() for col in initially_str_cols}\n | {col: pd.CategoricalDtype() for col in category_cols}\n | {col: \"uint32\" for col in u32_cols}\n | {col: \"uint8\" for col in u8_cols}\n | {col: \"uint16\" for col in u16_cols}\n)\n\n\n\n\nOne thing I’ll note here is that Pandas numeric types are somewhat confusing: \"uint32\" means np.uint32 which is not the same thing as pd.UInt32Dtype(). Only the latter is nullable. On the other hand, Polars has just one unsigned 32-bit integer type, and it’s nullable.\n\n\n\n\n\n\nTip\n\n\n\nPolars expressions have a shrink_dtype method that can be more convenient than manually specifying the dtypes yourself. It’s not magic though, and it has to spend time finding the min and max of the column.\n\n\n\n\n3.2.2 Data cleaning\nThere’s not much that you haven’t seen here already, so we won’t explain the code line by line. The main new thing here is pl.when for ternary expressions.\n\nPolarsPandas\n\n\n\ndef parse_date_pl(col: pl.Expr) -> pl.Expr:\n return col.str.strptime(pl.Date, format=\"%b %d, %Y\")\n\ndef parse_suffixed_num_pl(col: pl.Expr) -> pl.Expr:\n suffix = col.str.slice(-1, 1)\n suffix_value = (\n pl.when(suffix == \"K\")\n .then(1_000)\n .when(suffix == \"M\")\n .then(1_000_000)\n .otherwise(1)\n .cast(pl.UInt32)\n )\n without_suffix = (\n col\n .str.replace(\"K\", \"\", literal=True)\n .str.replace(\"M\", \"\", literal=True)\n .cast(pl.Float32)\n )\n original_name = col.meta.output_name()\n return (suffix_value * without_suffix).alias(original_name)\n\ndef parse_money_pl(col: pl.Expr) -> pl.Expr:\n return parse_suffixed_num_pl(col.str.slice(1)).cast(pl.UInt32)\n\ndef parse_star_pl(col: pl.Expr) -> pl.Expr:\n return col.str.slice(0, 1).cast(pl.UInt8)\n\ndef feet_to_cm_pl(col: pl.Expr) -> pl.Expr:\n feet_inches_split = col.str.split_exact(\"'\", 1)\n total_inches = (\n (feet_inches_split.struct.field(\"field_0\").cast(pl.UInt8, strict=False) * 12)\n + feet_inches_split.struct.field(\"field_1\").str.strip_chars_end('\"').cast(pl.UInt8, strict=False)\n )\n return (total_inches * 2.54).round(0).cast(pl.UInt8)\n\ndef parse_height_pl(col: pl.Expr) -> pl.Expr:\n is_cm = col.str.ends_with(\"cm\")\n return (\n pl.when(is_cm)\n .then(col.str.slice(0, 3).cast(pl.UInt8, strict=False))\n .otherwise(feet_to_cm_pl(col))\n )\n\ndef parse_weight_pl(col: pl.Expr) -> pl.Expr:\n is_kg = col.str.ends_with(\"kg\")\n without_unit = col.str.extract(r\"(\\d+)\").cast(pl.UInt8)\n return (\n pl.when(is_kg)\n .then(without_unit)\n .otherwise((without_unit * 0.453592).round(0).cast(pl.UInt8))\n )\n\ndef parse_contract_pl(col: pl.Expr) -> list[pl.Expr]:\n contains_tilde = col.str.contains(\" ~ \", literal=True)\n loan_str = \" On Loan\"\n loan_col = col.str.ends_with(loan_str)\n split = (\n pl.when(contains_tilde)\n .then(col)\n .otherwise(None)\n .str.split_exact(\" ~ \", 1)\n )\n start = split.struct.field(\"field_0\").cast(pl.UInt16).alias(\"contract_start\")\n end = split.struct.field(\"field_1\").cast(pl.UInt16).alias(\"contract_end\")\n free_agent = (col == \"Free\").alias(\"free_agent\").fill_null(False)\n loan_date = (\n pl.when(loan_col)\n .then(col)\n .otherwise(None)\n .str.split_exact(\" On Loan\", 1)\n .struct.field(\"field_0\")\n .alias(\"loan_date_start\")\n )\n return [start, end, free_agent, parse_date_pl(loan_date)]\n\n\n\n\ndef parse_date_pd(col: pd.Series) -> pd.Series:\n return pd.to_datetime(col, format=\"%b %d, %Y\")\n\ndef parse_suffixed_num_pd(col: pd.Series) -> pd.Series:\n suffix_value = (\n col\n .str[-1]\n .map({\"K\": 1_000, \"M\": 1_000_000})\n .fillna(1)\n .astype(\"uint32\")\n )\n without_suffix = (\n col\n .str.replace(\"K\", \"\", regex=False)\n .str.replace(\"M\", \"\", regex=False)\n .astype(\"float\")\n )\n return suffix_value * without_suffix\n\ndef parse_money_pd(col: pd.Series) -> pd.Series:\n return parse_suffixed_num_pd(col.str[1:]).astype(\"uint32\")\n\ndef parse_star_pd(col: pd.Series) -> pd.Series:\n return col.str[0].astype(\"uint8\")\n\ndef feet_to_cm_pd(col: pd.Series) -> pd.Series:\n feet_inches_split = col.str.split(\"'\", expand=True)\n total_inches = (\n feet_inches_split[0].astype(\"uint8\").mul(12)\n + feet_inches_split[1].str[:-1].astype(\"uint8\")\n )\n return total_inches.mul(2.54).round().astype(\"uint8\")\n\ndef parse_height_pd(col: pd.Series) -> pd.Series:\n is_cm = col.str.endswith(\"cm\")\n cm_values = col.loc[is_cm].str[:-2].astype(\"uint8\")\n inches_as_cm = feet_to_cm_pd(col.loc[~is_cm])\n return pd.concat([cm_values, inches_as_cm])\n\ndef parse_weight_pd(col: pd.Series) -> pd.Series:\n is_kg = col.str.endswith(\"kg\")\n without_unit = col.where(is_kg, col.str[:-3]).mask(is_kg, col.str[:-2]).astype(\"uint8\")\n return without_unit.where(is_kg, without_unit.mul(0.453592).round().astype(\"uint8\"))\n\ndef parse_contract_pd(df: pd.DataFrame) -> pd.DataFrame:\n contract_col = df[\"Contract\"]\n contains_tilde = contract_col.str.contains(\" ~ \", regex=False)\n split = (\n contract_col.loc[contains_tilde].str.split(\" ~ \", expand=True).astype(pd.UInt16Dtype())\n )\n split.columns = [\"contract_start\", \"contract_end\"]\n not_tilde = contract_col.loc[~contains_tilde]\n free_agent = (contract_col == \"Free\").rename(\"free_agent\").fillna(False)\n loan_date = parse_date_pd(not_tilde.loc[~free_agent].str[:-8]).rename(\"loan_date_start\")\n return pd.concat([df.drop(\"Contract\", axis=1), split, free_agent, loan_date], axis=1)\n\n\n\n\n\n\n3.2.3 Performance comparison\nIn this example, Polars is ~150x faster than Pandas:\n\nPolarsPandas\n\n\n\n%%time\nnew_cols_pl = ([\n pl.col(\"Club\").str.strip_chars().cast(pl.Categorical),\n parse_suffixed_num_pl(pl.col(\"Hits\")).cast(pl.UInt32),\n pl.col(\"Positions\").str.split(\",\"),\n parse_height_pl(pl.col(\"Height\")),\n parse_weight_pl(pl.col(\"Weight\")),\n]\n+ [parse_date_pl(pl.col(col)) for col in date_cols]\n+ [parse_money_pl(pl.col(col)) for col in money_cols]\n+ [parse_star_pl(pl.col(col)) for col in star_cols]\n+ parse_contract_pl(pl.col(\"Contract\"))\n+ [pl.col(col).cast(pl.UInt16) for col in u16_cols]\n+ [pl.col(col).cast(pl.UInt8) for col in u8_cols]\n)\nfifa_pl = (\n pl.scan_csv(\"../data/fifa21_raw_v2.csv\", dtypes=dtypes_pl)\n .with_columns(new_cols_pl)\n .drop(\"Contract\")\n .rename({\"↓OVA\": \"OVA\"})\n .collect()\n)\n\nCPU times: user 149 ms, sys: 20.8 ms, total: 170 ms\nWall time: 42.8 ms\n\n\n\n\n\n%%time\nfifa_pd = (\n pd.read_csv(\"../data/fifa21_raw_big.csv\", dtype=dtypes_pd)\n .assign(Club=lambda df: df[\"Club\"].cat.rename_categories(lambda c: c.strip()),\n **{col: lambda df: parse_date_pd(df[col]) for col in date_cols},\n **{col: lambda df: parse_money_pd(df[col]) for col in money_cols},\n **{col: lambda df: parse_star_pd(df[col]) for col in star_cols},\n Hits=lambda df: parse_suffixed_num_pd(df[\"Hits\"]).astype(pd.UInt32Dtype()),\n Positions=lambda df: df[\"Positions\"].str.split(\",\"),\n Height=lambda df: parse_height_pd(df[\"Height\"]),\n Weight=lambda df: parse_weight_pd(df[\"Weight\"])\n )\n .pipe(parse_contract_pd)\n .rename(columns={\"↓OVA\": \"OVA\"})\n)\n\nCPU times: user 6.53 s, sys: 336 ms, total: 6.87 s\nWall time: 6.86 s\n\n\n\n\n\nOutput:\n\nPolarsPandas\n\n\n\nfifa_pl.head()\n\n\n\nshape: (5, 80)IDNameLongNamephotoUrlplayerUrlNationalityAgeOVAPOTClubPositionsHeightWeightPreferred FootBOVBest PositionJoinedLoan Date EndValueWageRelease ClauseAttackingCrossingFinishingHeading AccuracyShort PassingVolleysSkillDribblingCurveFK AccuracyLong PassingBall ControlMovementAccelerationSprint SpeedAgility…StrengthLong ShotsMentalityAggressionInterceptionsPositioningVisionPenaltiesComposureDefendingMarkingStanding TackleSliding TackleGoalkeepingGK DivingGK HandlingGK KickingGK PositioningGK ReflexesTotal StatsBase StatsW/FSMA/WD/WIRPACSHOPASDRIDEFPHYHitscontract_startcontract_endfree_agentloan_date_startu32strstrstrstrcatu8u8u8catlist[str]u8u8catu8catdatedateu32u32u32u16u8u8u8u8u8u16u8u8u8u8u8u16u8u8u8…u8u8u16u8u8u8u8u8u8u16u8u8u8u16u8u8u8u8u8u16u16u8u8catcatu8u8u8u8u8u8u8u32u16u16booldate158023\"L. Messi\"\"Lionel Messi\"\"https://cdn.so…\"http://sofifa.…\"Argentina\"339393\"FC Barcelona\"[\"RW\", \" ST\", \" CF\"]17072\"Left\"93\"RW\"2004-07-01null10350000056000013839999342985957091884709693949196451918091…6994347444093957596913235245461115148223146644\"Medium\"\"Low\"585929195386577120042021falsenull20801\"Cristiano Rona…\"C. Ronaldo dos…\"https://cdn.so…\"http://sofifa.…\"Portugal\"359292\"Juventus\"[\"ST\", \" LW\"]18783\"Right\"92\"ST\"2018-07-10null630000002200007590000143784959082864148881767792431879187…78933536329958284958428322458711151411222146445\"High\"\"Low\"589938189357756220182022falsenull200389\"J. Oblak\"\"Jan Oblak\"\"https://cdn.so…\"http://sofifa.…\"Slovenia\"279193\"Atlético Madri…[\"GK\"]18887\"Right\"91\"GK\"2014-07-16null1200000001250001593999939513111543131091213144030307436067…7812140341911651168572712184378792789090141348931\"Medium\"\"Medium\"387927890529015020142023falsenull192985\"K. De Bruyne\"\"Kevin De Bruyn…\"https://cdn.so…\"http://sofifa.…\"Belgium\"299191\"Manchester Cit…[\"CAM\", \" CM\"]18170\"Right\"91\"CAM\"2015-08-30null12900000037000016100000040794825594824418885839392398777678…749140876668894849118668655356151351013230448554\"High\"\"High\"476869388647820720152023falsenull190871\"Neymar Jr\"\"Neymar da Silv…\"https://cdn.so…\"http://sofifa.…\"Brazil\"289191\"Paris Saint-Ge…[\"LW\", \" CAM\"]17568\"Right\"91\"LW\"2017-08-03null13200000027000016650000040885876287874489588898195453948996…5084356513687909293943530295999151511217545155\"High\"\"Medium\"591858694365959520172022falsenull\n\n\n\n\n\nfifa_pd.head()\n\n\n\n\n\n \n \n \n ID\n Name\n LongName\n photoUrl\n playerUrl\n Nationality\n Age\n OVA\n POT\n Club\n ...\n SHO\n PAS\n DRI\n DEF\n PHY\n Hits\n contract_start\n contract_end\n free_agent\n loan_date_start\n \n \n \n \n 0\n 158023\n L. Messi\n Lionel Messi\n https://cdn.sofifa.com/players/158/023/21_60.png\n http://sofifa.com/player/158023/lionel-messi/2...\n Argentina\n 33\n 93\n 93\n FC Barcelona\n ...\n 92\n 91\n 95\n 38\n 65\n 771\n 2004\n 2021\n False\n NaT\n \n \n 1\n 20801\n Cristiano Ronaldo\n C. Ronaldo dos Santos Aveiro\n https://cdn.sofifa.com/players/020/801/21_60.png\n http://sofifa.com/player/20801/c-ronaldo-dos-s...\n Portugal\n 35\n 92\n 92\n Juventus\n ...\n 93\n 81\n 89\n 35\n 77\n 562\n 2018\n 2022\n False\n NaT\n \n \n 2\n 200389\n J. Oblak\n Jan Oblak\n https://cdn.sofifa.com/players/200/389/21_60.png\n http://sofifa.com/player/200389/jan-oblak/210006/\n Slovenia\n 27\n 91\n 93\n Atlético Madrid\n ...\n 92\n 78\n 90\n 52\n 90\n 150\n 2014\n 2023\n False\n NaT\n \n \n 3\n 192985\n K. De Bruyne\n Kevin De Bruyne\n https://cdn.sofifa.com/players/192/985/21_60.png\n http://sofifa.com/player/192985/kevin-de-bruyn...\n Belgium\n 29\n 91\n 91\n Manchester City\n ...\n 86\n 93\n 88\n 64\n 78\n 207\n 2015\n 2023\n False\n NaT\n \n \n 4\n 190871\n Neymar Jr\n Neymar da Silva Santos Jr.\n https://cdn.sofifa.com/players/190/871/21_60.png\n http://sofifa.com/player/190871/neymar-da-silv...\n Brazil\n 28\n 91\n 91\n Paris Saint-Germain\n ...\n 85\n 86\n 94\n 36\n 59\n 595\n 2017\n 2022\n False\n NaT\n \n \n\n5 rows × 80 columns\n\n\n\n\n\n\nYou could play around with the timings here and even try the .profile method to see what Polars spends its time on. In this scenario the speed advantage of Polars likely comes down to three things:\n\nIt is much faster at reading CSVs.\nIt is much faster at processing strings.\nIt can select/assign columns in parallel."
},
{
"objectID": "performance.html#numpy-can-make-polars-faster",
"href": "performance.html#numpy-can-make-polars-faster",
"title": "3 Performance",
"section": "3.3 NumPy can make Polars faster",
- "text": "3.3 NumPy can make Polars faster\nPolars gets along well with NumPy ufuncs, even in lazy mode (which is interesting because NumPy has no lazy API). Let’s see how this looks by calculating the great-circle distance between a bunch of coordinates.\n\n3.3.1 Get the data\nWe create a lazy dataframe containing pairs of airports and their coordinates:\n\nairports = pl.scan_csv(\"../data/airports.csv\").drop_nulls().unique(subset=[\"AIRPORT\"])\npairs = airports.join(airports, on=\"AIRPORT\", how=\"cross\").filter(\n (pl.col(\"AIRPORT\") != pl.col(\"AIRPORT_right\"))\n & (pl.col(\"LATITUDE\") != pl.col(\"LATITUDE_right\"))\n & (pl.col(\"LONGITUDE\") != pl.col(\"LONGITUDE_right\"))\n)\n\n\n\n3.3.2 Calculate great-circle distance\nOne use case for NumPy ufuncs is doing computations that Polars expressions don’t support. In this example Polars can do everything we need, though the ufunc version ends up being slightly faster:\n\nPolarsNumPy\n\n\n\ndef deg2rad_pl(degrees: pl.Expr) -> pl.Expr:\n return degrees * math.pi / 180\n\ndef gcd_pl(lat1: pl.Expr, lng1: pl.Expr, lat2: pl.Expr, lng2: pl.Expr):\n ϕ1 = deg2rad_pl(90 - lat1)\n ϕ2 = deg2rad_pl(90 - lat2)\n\n θ1 = deg2rad_pl(lng1)\n θ2 = deg2rad_pl(lng2)\n\n cos = ϕ1.sin() * ϕ2.sin() * (θ1 - θ2).cos() + ϕ1.cos() * ϕ2.cos()\n arc = cos.arccos()\n return arc * 6373\n\n\n\n\ndef gcd_np(lat1, lng1, lat2, lng2):\n ϕ1 = np.deg2rad(90 - lat1)\n ϕ2 = np.deg2rad(90 - lat2)\n\n θ1 = np.deg2rad(lng1)\n θ2 = np.deg2rad(lng2)\n\n cos = np.sin(ϕ1) * np.sin(ϕ2) * np.cos(θ1 - θ2) + np.cos(ϕ1) * np.cos(ϕ2)\n arc = np.arccos(cos)\n return arc * 6373\n\n\n\n\nWe can pass Polars expressions directly to our gcd_np function, which is pretty nice since these things don’t even store the data themselves:\n\n%%timeit\npairs.select(\n gcd_np(\n pl.col(\"LATITUDE\"),\n pl.col(\"LONGITUDE\"),\n pl.col(\"LATITUDE_right\"),\n pl.col(\"LONGITUDE_right\")\n )\n).collect()\n\n4 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n\n\nOn my machine the NumPy version tends to be 5-20% faster than the pure Polars version:\n\n%%timeit\npairs.select(\n gcd_pl(\n pl.col(\"LATITUDE\"),\n pl.col(\"LONGITUDE\"),\n pl.col(\"LATITUDE_right\"),\n pl.col(\"LONGITUDE_right\")\n )\n).collect()\n\n5.19 s ± 296 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n\n\nThis may not be a huge performance difference, but it at least means you don’t sacrifice speed when relying on NumPy. There are some gotchas though so watch out for those.\nAlso watch out for .to_numpy() - you don’t always need to call this and it can slow things down:\n\n%%timeit\ncollected = pairs.collect()\ngcd_np(\n collected[\"LATITUDE\"].to_numpy(),\n collected[\"LONGITUDE\"].to_numpy(),\n collected[\"LATITUDE_right\"].to_numpy(),\n collected[\"LONGITUDE_right\"].to_numpy()\n)\n\n6.23 s ± 79.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)"
+ "text": "3.3 NumPy can make Polars faster\nPolars gets along well with NumPy ufuncs, even in lazy mode (which is interesting because NumPy has no lazy API). Let’s see how this looks by calculating the great-circle distance between a bunch of coordinates.\n\n3.3.1 Get the data\nWe create a lazy dataframe containing pairs of airports and their coordinates:\n\nairports = pl.scan_csv(\"../data/airports.csv\").drop_nulls().unique(subset=[\"AIRPORT\"])\npairs = airports.join(airports, on=\"AIRPORT\", how=\"cross\").filter(\n (pl.col(\"AIRPORT\") != pl.col(\"AIRPORT_right\"))\n & (pl.col(\"LATITUDE\") != pl.col(\"LATITUDE_right\"))\n & (pl.col(\"LONGITUDE\") != pl.col(\"LONGITUDE_right\"))\n)\n\n\n\n3.3.2 Calculate great-circle distance\nOne use case for NumPy ufuncs is doing computations that Polars expressions don’t support. In this example Polars can do everything we need, though the ufunc version ends up being slightly faster:\n\nPolarsNumPy\n\n\n\ndef deg2rad_pl(degrees: pl.Expr) -> pl.Expr:\n return degrees * math.pi / 180\n\ndef gcd_pl(lat1: pl.Expr, lng1: pl.Expr, lat2: pl.Expr, lng2: pl.Expr):\n ϕ1 = deg2rad_pl(90 - lat1)\n ϕ2 = deg2rad_pl(90 - lat2)\n\n θ1 = deg2rad_pl(lng1)\n θ2 = deg2rad_pl(lng2)\n\n cos = ϕ1.sin() * ϕ2.sin() * (θ1 - θ2).cos() + ϕ1.cos() * ϕ2.cos()\n arc = cos.arccos()\n return arc * 6373\n\n\n\n\ndef gcd_np(lat1, lng1, lat2, lng2):\n ϕ1 = np.deg2rad(90 - lat1)\n ϕ2 = np.deg2rad(90 - lat2)\n\n θ1 = np.deg2rad(lng1)\n θ2 = np.deg2rad(lng2)\n\n cos = np.sin(ϕ1) * np.sin(ϕ2) * np.cos(θ1 - θ2) + np.cos(ϕ1) * np.cos(ϕ2)\n arc = np.arccos(cos)\n return arc * 6373\n\n\n\n\nWe can pass Polars expressions directly to our gcd_np function, which is pretty nice since these things don’t even store the data themselves:\n\n%%timeit\npairs.select(\n gcd_np(\n pl.col(\"LATITUDE\"),\n pl.col(\"LONGITUDE\"),\n pl.col(\"LATITUDE_right\"),\n pl.col(\"LONGITUDE_right\")\n )\n).collect()\n\n3.91 s ± 53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n\n\nOn my machine the NumPy version tends to be 5-20% faster than the pure Polars version:\n\n%%timeit\npairs.select(\n gcd_pl(\n pl.col(\"LATITUDE\"),\n pl.col(\"LONGITUDE\"),\n pl.col(\"LATITUDE_right\"),\n pl.col(\"LONGITUDE_right\")\n )\n).collect()\n\n4.64 s ± 51.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n\n\nThis may not be a huge performance difference, but it at least means you don’t sacrifice speed when relying on NumPy. There are some gotchas though so watch out for those.\nAlso watch out for .to_numpy() - you don’t always need to call this and it can slow things down:\n\n%%timeit\ncollected = pairs.collect()\ngcd_np(\n collected[\"LATITUDE\"].to_numpy(),\n collected[\"LONGITUDE\"].to_numpy(),\n collected[\"LATITUDE_right\"].to_numpy(),\n collected[\"LONGITUDE_right\"].to_numpy()\n)\n\n5.22 s ± 187 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)"
},
{
"objectID": "performance.html#polars-can-be-slower-than-pandas-sometimes-maybe",
"href": "performance.html#polars-can-be-slower-than-pandas-sometimes-maybe",
"title": "3 Performance",
"section": "3.4 Polars can be slower than Pandas sometimes, maybe",
- "text": "3.4 Polars can be slower than Pandas sometimes, maybe\nHere’s an example where we calculate z-scores, using window functions in Polars and using groupby-transform in Pandas:\n\ndef create_frame(n, n_groups):\n return pl.DataFrame(\n {\"name\": np.random.randint(0, n_groups, size=n), \"value2\": np.random.randn(n)}\n )\n\ndef pandas_transform(df: pd.DataFrame) -> pd.DataFrame:\n g = df.groupby(\"name\")[\"value2\"]\n v = df[\"value2\"]\n return (v - g.transform(\"mean\")) / g.transform(\"std\")\n\n\ndef polars_transform() -> pl.Expr:\n v = pl.col(\"value2\")\n return (v - v.mean().over(\"name\")) / v.std().over(\"name\")\n\nrand_df_pl = create_frame(50_000_000, 50_000)\nrand_df_pd = rand_df_pl.to_pandas()\n\nThe Polars version tends to be 10-100% slower on my machine:\n\nPolarsPandas\n\n\n\n%timeit rand_df_pl.select(polars_transform())\n\n3.32 s ± 91.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n\n\n\n\n\n%timeit pandas_transform(rand_df_pd)\n\n2.18 s ± 34.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n\n\n\n\n\nThis example isn’t telling you to use Pandas in this specific situation. Once you add in the time spent reading a file, Polars likely wins.\nAnd even here, if you sort by the name col, Polars wins again. It has fast-track algorithms for sorted data."
+ "text": "3.4 Polars can be slower than Pandas sometimes, maybe\nHere’s an example where we calculate z-scores, using window functions in Polars and using groupby-transform in Pandas:\n\ndef create_frame(n, n_groups):\n return pl.DataFrame(\n {\"name\": np.random.randint(0, n_groups, size=n), \"value2\": np.random.randn(n)}\n )\n\ndef pandas_transform(df: pd.DataFrame) -> pd.DataFrame:\n g = df.groupby(\"name\")[\"value2\"]\n v = df[\"value2\"]\n return (v - g.transform(\"mean\")) / g.transform(\"std\")\n\n\ndef polars_transform() -> pl.Expr:\n v = pl.col(\"value2\")\n return (v - v.mean().over(\"name\")) / v.std().over(\"name\")\n\nrand_df_pl = create_frame(50_000_000, 50_000)\nrand_df_pd = rand_df_pl.to_pandas()\n\nThe Polars version tends to be 10-100% slower on my machine:\n\nPolarsPandas\n\n\n\n%timeit rand_df_pl.select(polars_transform())\n\n3.08 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n\n\n\n\n\n%timeit pandas_transform(rand_df_pd)\n\n2.18 s ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n\n\n\n\n\nThis example isn’t telling you to use Pandas in this specific situation. Once you add in the time spent reading a file, Polars likely wins.\nAnd even here, if you sort by the name col, Polars wins again. It has fast-track algorithms for sorted data."
},
{
"objectID": "performance.html#summary",
@@ -172,7 +172,7 @@
"href": "tidy.html#cleaning",
"title": "4 Reshaping and Tidy Data",
"section": "4.2 Cleaning 🧹",
- "text": "4.2 Cleaning 🧹\nNothing super interesting here:\n\nPolarsPandas\n\n\n\ngames_pl = (\n pl.scan_csv(nba_glob)\n .with_columns(\n pl.col(\"date\").str.strptime(pl.Date, \"%a, %b %d, %Y\"),\n )\n .sort(\"date\")\n .with_row_count(\"game_id\")\n)\ngames_pl.head().collect()\n\n/tmp/ipykernel_14427/3011805186.py:7: DeprecationWarning:\n\n`with_row_count` is deprecated. Use `with_row_index` instead. Note that the default column name has changed from 'row_nr' to 'index'.\n\n\n\n\n\nshape: (5, 6)game_iddateaway_teamaway_pointshome_teamhome_pointsu32datestri64stri6402015-10-27\"Cleveland Cava…95\"Chicago Bulls\"9712015-10-27\"Detroit Piston…106\"Atlanta Hawks\"9422015-10-27\"New Orleans Pe…95\"Golden State W…11132015-10-28\"Washington Wiz…88\"Orlando Magic\"8742015-10-28\"Philadelphia 7…95\"Boston Celtics…112\n\n\n\n\n\ngames_pd = (\n pl.read_csv(nba_glob)\n .to_pandas()\n .dropna(how=\"all\")\n .assign(date=lambda x: pd.to_datetime(x[\"date\"], format=\"%a, %b %d, %Y\"))\n .sort_values(\"date\")\n .reset_index(drop=True)\n .set_index(\"date\", append=True)\n .rename_axis([\"game_id\", \"date\"])\n .sort_index()\n)\ngames_pd.head()\n\n\n\n\n\n \n \n \n \n away_team\n away_points\n home_team\n home_points\n \n \n game_id\n date\n \n \n \n \n \n \n \n \n 0\n 2015-10-27\n Cleveland Cavaliers\n 95\n Chicago Bulls\n 97\n \n \n 1\n 2015-10-27\n Detroit Pistons\n 106\n Atlanta Hawks\n 94\n \n \n 2\n 2015-10-27\n New Orleans Pelicans\n 95\n Golden State Warriors\n 111\n \n \n 3\n 2015-10-28\n Philadelphia 76ers\n 95\n Boston Celtics\n 112\n \n \n 4\n 2015-10-28\n Washington Wizards\n 88\n Orlando Magic\n 87\n \n \n\n\n\n\n\n\n\nPolars does have a drop_nulls method but the only parameter it takes is subset, which — like in Pandas — lets you consider null values just for a subset of the columns. Pandas additionally lets you specify how=\"all\" to drop a row only if every value is null, but Polars drop_nulls has no such parameter and will drop the row if any values are null. If you only want to drop when all values are null, the docs recommend .filter(~pl.all(pl.all().is_null())).\n\n\n\n\n\n\nNote\n\n\n\nA previous version of the Polars example used pl.fold, which is for fast horizontal operations. It doesn’t come up anywhere else in this book, so consider this your warning that it exists."
+ "text": "4.2 Cleaning 🧹\nNothing super interesting here:\n\nPolarsPandas\n\n\n\ngames_pl = (\n pl.scan_csv(nba_glob)\n .with_columns(\n pl.col(\"date\").str.strptime(pl.Date, \"%a, %b %d, %Y\"),\n )\n .sort(\"date\")\n .with_row_index(\"game_id\")\n)\ngames_pl.head().collect()\n\n\n\nshape: (5, 6)game_iddateaway_teamaway_pointshome_teamhome_pointsu32datestri64stri6402015-10-27\"Cleveland Cava…95\"Chicago Bulls\"9712015-10-27\"Detroit Piston…106\"Atlanta Hawks\"9422015-10-27\"New Orleans Pe…95\"Golden State W…11132015-10-28\"Washington Wiz…88\"Orlando Magic\"8742015-10-28\"Philadelphia 7…95\"Boston Celtics…112\n\n\n\n\n\ngames_pd = (\n pl.read_csv(nba_glob)\n .to_pandas()\n .dropna(how=\"all\")\n .assign(date=lambda x: pd.to_datetime(x[\"date\"], format=\"%a, %b %d, %Y\"))\n .sort_values(\"date\")\n .reset_index(drop=True)\n .set_index(\"date\", append=True)\n .rename_axis([\"game_id\", \"date\"])\n .sort_index()\n)\ngames_pd.head()\n\n\n\n\n\n \n \n \n \n away_team\n away_points\n home_team\n home_points\n \n \n game_id\n date\n \n \n \n \n \n \n \n \n 0\n 2015-10-27\n Cleveland Cavaliers\n 95\n Chicago Bulls\n 97\n \n \n 1\n 2015-10-27\n Detroit Pistons\n 106\n Atlanta Hawks\n 94\n \n \n 2\n 2015-10-27\n New Orleans Pelicans\n 95\n Golden State Warriors\n 111\n \n \n 3\n 2015-10-28\n Philadelphia 76ers\n 95\n Boston Celtics\n 112\n \n \n 4\n 2015-10-28\n Washington Wizards\n 88\n Orlando Magic\n 87\n \n \n\n\n\n\n\n\n\nPolars does have a drop_nulls method but the only parameter it takes is subset, which — like in Pandas — lets you consider null values just for a subset of the columns. Pandas additionally lets you specify how=\"all\" to drop a row only if every value is null, but Polars drop_nulls has no such parameter and will drop the row if any values are null. If you only want to drop when all values are null, the docs recommend .filter(~pl.all(pl.all().is_null())).\n\n\n\n\n\n\nNote\n\n\n\nA previous version of the Polars example used pl.fold, which is for fast horizontal operations. It doesn’t come up anywhere else in this book, so consider this your warning that it exists."
},
{
"objectID": "tidy.html#pivot-and-melt",
@@ -186,7 +186,7 @@
"href": "tidy.html#tidy-nba-data",
"title": "4 Reshaping and Tidy Data",
"section": "4.4 Tidy NBA data",
- "text": "4.4 Tidy NBA data\nSuppose we want to calculate the days of rest each team had before each game. In the current structure this is difficult because we need to track both the home_team and away_team columns. We’ll use .melt so that there’s a single team column. This makes it easier to add a rest column with the per-team rest days between games.\n\nPolarsPandas\n\n\n\ntidy_pl = (\n games_pl\n .melt(\n id_vars=[\"game_id\", \"date\"],\n value_vars=[\"away_team\", \"home_team\"],\n value_name=\"team\",\n )\n .sort(\"game_id\")\n .with_columns((\n pl.col(\"date\")\n .alias(\"rest\")\n .diff().over(\"team\")\n .dt.total_days() - 1).cast(pl.Int8))\n .drop_nulls(\"rest\")\n .collect()\n)\ntidy_pl\n\n\n\nshape: (2_602, 5)game_iddatevariableteamrestu32datestrstri852015-10-28\"away_team\"\"Chicago Bulls\"062015-10-28\"home_team\"\"Detroit Piston…0112015-10-28\"away_team\"\"Cleveland Cava…0……………13152016-06-19\"away_team\"\"Cleveland Cava…213152016-06-19\"home_team\"\"Golden State W…2\n\n\n\n\n\ntidy_pd = (\n games_pd.reset_index()\n .melt(\n id_vars=[\"game_id\", \"date\"],\n value_vars=[\"away_team\", \"home_team\"],\n value_name=\"team\",\n )\n .sort_values(\"game_id\")\n .assign(\n rest=lambda df: (\n df\n .sort_values(\"date\")\n .groupby(\"team\")\n [\"date\"]\n .diff()\n .dt.days\n .sub(1)\n )\n )\n .dropna(subset=[\"rest\"])\n .astype({\"rest\": pd.Int8Dtype()})\n)\ntidy_pd\n\n\n\n\n\n \n \n \n game_id\n date\n variable\n team\n rest\n \n \n \n \n 7\n 7\n 2015-10-28\n away_team\n New Orleans Pelicans\n 0\n \n \n 11\n 11\n 2015-10-28\n away_team\n Chicago Bulls\n 0\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n \n \n 1315\n 1315\n 2016-06-19\n away_team\n Cleveland Cavaliers\n 2\n \n \n 2631\n 1315\n 2016-06-19\n home_team\n Golden State Warriors\n 2\n \n \n\n2602 rows × 5 columns\n\n\n\n\n\n\nNow we use .pivot so that this days-of-rest data can be added back to the original dataframe. We’ll also add columns for the spread between the home team’s rest and away team’s rest, and a flag for whether the home team won.\n\nPolarsPandas\n\n\n\nby_game_pl = (\n tidy_pl\n .pivot(\n values=\"rest\",\n index=[\"game_id\", \"date\"],\n columns=\"variable\"\n )\n .rename({\"away_team\": \"away_rest\", \"home_team\": \"home_rest\"})\n)\njoined_pl = (\n by_game_pl\n .join(games_pl.collect(), on=[\"game_id\", \"date\"])\n .with_columns([\n pl.col(\"home_points\").alias(\"home_win\") > pl.col(\"away_points\"),\n pl.col(\"home_rest\").alias(\"rest_spread\") - pl.col(\"away_rest\"),\n ])\n)\njoined_pl\n\n\n\nshape: (1_303, 10)game_iddateaway_resthome_restaway_teamaway_pointshome_teamhome_pointshome_winrest_spreadu32datei8i8stri64stri64booli852015-10-280null\"Chicago Bulls\"115\"Brooklyn Nets\"100falsenull62015-10-28null0\"Utah Jazz\"87\"Detroit Piston…92truenull112015-10-280null\"Cleveland Cava…106\"Memphis Grizzl…76falsenull…………………………13142016-06-1622\"Golden State W…101\"Cleveland Cava…115true013152016-06-1922\"Cleveland Cava…93\"Golden State W…89false0\n\n\n\n\n\nby_game_pd = (\n tidy_pd\n .pivot(\n values=\"rest\",\n index=[\"game_id\", \"date\"],\n columns=\"variable\"\n )\n .rename(\n columns={\"away_team\": \"away_rest\", \"home_team\": \"home_rest\"}\n )\n)\njoined_pd = by_game_pd.join(games_pd).assign(\n home_win=lambda df: df[\"home_points\"] > df[\"away_points\"],\n rest_spread=lambda df: df[\"home_rest\"] - df[\"away_rest\"],\n)\njoined_pd\n\n\n\n\n\n \n \n \n \n away_rest\n home_rest\n away_team\n away_points\n home_team\n home_points\n home_win\n rest_spread\n \n \n game_id\n date\n \n \n \n \n \n \n \n \n \n \n \n \n 7\n 2015-10-28\n 0\n \n New Orleans Pelicans\n 94\n Portland Trail Blazers\n 112\n True\n \n \n \n 11\n 2015-10-28\n 0\n \n Chicago Bulls\n 115\n Brooklyn Nets\n 100\n False\n \n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n 1314\n 2016-06-16\n 2\n 2\n Golden State Warriors\n 101\n Cleveland Cavaliers\n 115\n True\n 0\n \n \n 1315\n 2016-06-19\n 2\n 2\n Cleveland Cavaliers\n 93\n Golden State Warriors\n 89\n False\n 0\n \n \n\n1303 rows × 8 columns\n\n\n\n\n\n\nHere’s a lightly edited quote from Modern Pandas:\n\nOne somewhat subtle point: an “observation” depends on the question being asked. So really, we have two tidy datasets, tidy for answering team-level questions, and joined for answering game-level questions.\n\nLet’s use the team-level dataframe to see each team’s average days of rest, both at home and away:\n\nimport seaborn as sns\nsns.set_theme(font_scale=0.6)\nsns.catplot(\n tidy_pl,\n x=\"variable\",\n y=\"rest\",\n col=\"team\",\n col_wrap=5,\n kind=\"bar\",\n height=1.5,\n)\n\n\n\n\n\n\n\nPlotting the distribution of rest_spread:\n\nPolarsPandas\n\n\n\nimport numpy as np\ndelta_pl = joined_pl[\"rest_spread\"]\nax = (\n delta_pl\n .value_counts()\n .drop_nulls()\n .to_pandas()\n .set_index(\"rest_spread\")\n [\"count\"]\n .reindex(np.arange(delta_pl.min(), delta_pl.max() + 1), fill_value=0)\n .sort_index()\n .plot(kind=\"bar\", color=\"k\", width=0.9, rot=0, figsize=(9, 6))\n)\nax.set(xlabel=\"Difference in Rest (Home - Away)\", ylabel=\"Games\")\n\n[Text(0.5, 0, 'Difference in Rest (Home - Away)'), Text(0, 0.5, 'Games')]\n\n\n\n\n\n\n\n\ndelta_pd = joined_pd[\"rest_spread\"]\nax = (\n delta_pd\n .value_counts()\n .reindex(np.arange(delta_pd.min(), delta_pd.max() + 1), fill_value=0)\n .sort_index()\n .plot(kind=\"bar\", color=\"k\", width=0.9, rot=0, figsize=(9, 6))\n)\nax.set(xlabel=\"Difference in Rest (Home - Away)\", ylabel=\"Games\")\n\n[Text(0.5, 0, 'Difference in Rest (Home - Away)'), Text(0, 0.5, 'Games')]\n\n\n\n\n\n\n\n\nPlotting the win percent by rest_spread:\n\nPolarsPandas\n\n\n\nimport matplotlib.pyplot as plt\nfig, ax = plt.subplots(figsize=(9, 6))\nsns.barplot(\n x=\"rest_spread\",\n y=\"home_win\",\n data=joined_pl.filter(pl.col(\"rest_spread\").is_between(-3, 3, closed=\"both\")),\n color=\"#4c72b0\",\n ax=ax,\n)\n\n\n\n\n\n\n\n\n\n\nfig, ax = plt.subplots(figsize=(9, 6))\nsns.barplot(\n x=\"rest_spread\",\n y=\"home_win\",\n data=joined_pd.query('-3 <= rest_spread <= 3'),\n color=\"#4c72b0\",\n ax=ax,\n)\n\n"
+ "text": "4.4 Tidy NBA data\nSuppose we want to calculate the days of rest each team had before each game. In the current structure this is difficult because we need to track both the home_team and away_team columns. We’ll use .melt so that there’s a single team column. This makes it easier to add a rest column with the per-team rest days between games.\n\nPolarsPandas\n\n\n\ntidy_pl = (\n games_pl\n .melt(\n id_vars=[\"game_id\", \"date\"],\n value_vars=[\"away_team\", \"home_team\"],\n value_name=\"team\",\n )\n .sort(\"game_id\")\n .with_columns((\n pl.col(\"date\")\n .alias(\"rest\")\n .diff().over(\"team\")\n .dt.total_days() - 1).cast(pl.Int8))\n .drop_nulls(\"rest\")\n .collect()\n)\ntidy_pl\n\n\n\nshape: (2_602, 5)game_iddatevariableteamrestu32datestrstri852015-10-28\"away_team\"\"Chicago Bulls\"062015-10-28\"home_team\"\"Detroit Piston…0112015-10-28\"away_team\"\"Cleveland Cava…0……………13152016-06-19\"away_team\"\"Cleveland Cava…213152016-06-19\"home_team\"\"Golden State W…2\n\n\n\n\n\ntidy_pd = (\n games_pd.reset_index()\n .melt(\n id_vars=[\"game_id\", \"date\"],\n value_vars=[\"away_team\", \"home_team\"],\n value_name=\"team\",\n )\n .sort_values(\"game_id\")\n .assign(\n rest=lambda df: (\n df\n .sort_values(\"date\")\n .groupby(\"team\")\n [\"date\"]\n .diff()\n .dt.days\n .sub(1)\n )\n )\n .dropna(subset=[\"rest\"])\n .astype({\"rest\": pd.Int8Dtype()})\n)\ntidy_pd\n\n\n\n\n\n \n \n \n game_id\n date\n variable\n team\n rest\n \n \n \n \n 7\n 7\n 2015-10-28\n away_team\n New Orleans Pelicans\n 0\n \n \n 11\n 11\n 2015-10-28\n away_team\n Chicago Bulls\n 0\n \n \n ...\n ...\n ...\n ...\n ...\n ...\n \n \n 1315\n 1315\n 2016-06-19\n away_team\n Cleveland Cavaliers\n 2\n \n \n 2631\n 1315\n 2016-06-19\n home_team\n Golden State Warriors\n 2\n \n \n\n2602 rows × 5 columns\n\n\n\n\n\n\nNow we use .pivot so that this days-of-rest data can be added back to the original dataframe. We’ll also add columns for the spread between the home team’s rest and away team’s rest, and a flag for whether the home team won.\n\nPolarsPandas\n\n\n\nby_game_pl = (\n tidy_pl\n .pivot(\n values=\"rest\",\n index=[\"game_id\", \"date\"],\n columns=\"variable\"\n )\n .rename({\"away_team\": \"away_rest\", \"home_team\": \"home_rest\"})\n)\njoined_pl = (\n by_game_pl\n .join(games_pl.collect(), on=[\"game_id\", \"date\"])\n .with_columns([\n pl.col(\"home_points\").alias(\"home_win\") > pl.col(\"away_points\"),\n pl.col(\"home_rest\").alias(\"rest_spread\") - pl.col(\"away_rest\"),\n ])\n)\njoined_pl\n\n\n\nshape: (1_303, 10)game_iddateaway_resthome_restaway_teamaway_pointshome_teamhome_pointshome_winrest_spreadu32datei8i8stri64stri64booli852015-10-280null\"Chicago Bulls\"115\"Brooklyn Nets\"100falsenull62015-10-28null0\"Utah Jazz\"87\"Detroit Piston…92truenull112015-10-280null\"Cleveland Cava…106\"Memphis Grizzl…76falsenull…………………………13142016-06-1622\"Golden State W…101\"Cleveland Cava…115true013152016-06-1922\"Cleveland Cava…93\"Golden State W…89false0\n\n\n\n\n\nby_game_pd = (\n tidy_pd\n .pivot(\n values=\"rest\",\n index=[\"game_id\", \"date\"],\n columns=\"variable\"\n )\n .rename(\n columns={\"away_team\": \"away_rest\", \"home_team\": \"home_rest\"}\n )\n)\njoined_pd = by_game_pd.join(games_pd).assign(\n home_win=lambda df: df[\"home_points\"] > df[\"away_points\"],\n rest_spread=lambda df: df[\"home_rest\"] - df[\"away_rest\"],\n)\njoined_pd\n\n\n\n\n\n \n \n \n \n away_rest\n home_rest\n away_team\n away_points\n home_team\n home_points\n home_win\n rest_spread\n \n \n game_id\n date\n \n \n \n \n \n \n \n \n \n \n \n \n 7\n 2015-10-28\n 0\n \n New Orleans Pelicans\n 94\n Portland Trail Blazers\n 112\n True\n \n \n \n 11\n 2015-10-28\n 0\n \n Chicago Bulls\n 115\n Brooklyn Nets\n 100\n False\n \n \n \n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n ...\n \n \n 1314\n 2016-06-16\n 2\n 2\n Golden State Warriors\n 101\n Cleveland Cavaliers\n 115\n True\n 0\n \n \n 1315\n 2016-06-19\n 2\n 2\n Cleveland Cavaliers\n 93\n Golden State Warriors\n 89\n False\n 0\n \n \n\n1303 rows × 8 columns\n\n\n\n\n\n\nHere’s a lightly edited quote from Modern Pandas:\n\nOne somewhat subtle point: an “observation” depends on the question being asked. So really, we have two tidy datasets, tidy for answering team-level questions, and joined for answering game-level questions.\n\nLet’s use the team-level dataframe to see each team’s average days of rest, both at home and away:\n\nimport seaborn as sns\nsns.set_theme(font_scale=0.6)\nsns.catplot(\n tidy_pl,\n x=\"variable\",\n y=\"rest\",\n col=\"team\",\n col_wrap=5,\n kind=\"bar\",\n height=1.5,\n)\n\n\n\n\n\n\n\nPlotting the distribution of rest_spread:\n\nPolarsPandas\n\n\n\nimport numpy as np\ndelta_pl = joined_pl[\"rest_spread\"]\nax = (\n delta_pl\n .value_counts()\n .drop_nulls()\n .to_pandas()\n .set_index(\"rest_spread\")\n [\"count\"]\n .reindex(np.arange(delta_pl.min(), delta_pl.max() + 1), fill_value=0)\n .sort_index()\n .plot(kind=\"bar\", color=\"k\", width=0.9, rot=0, figsize=(9, 6))\n)\nax.set(xlabel=\"Difference in Rest (Home - Away)\", ylabel=\"Games\")\n\n[Text(0.5, 0, 'Difference in Rest (Home - Away)'), Text(0, 0.5, 'Games')]\n\n\n\n\n\n\n\n\ndelta_pd = joined_pd[\"rest_spread\"]\nax = (\n delta_pd\n .value_counts()\n .reindex(np.arange(delta_pd.min(), delta_pd.max() + 1), fill_value=0)\n .sort_index()\n .plot(kind=\"bar\", color=\"k\", width=0.9, rot=0, figsize=(9, 6))\n)\nax.set(xlabel=\"Difference in Rest (Home - Away)\", ylabel=\"Games\")\n\n[Text(0.5, 0, 'Difference in Rest (Home - Away)'), Text(0, 0.5, 'Games')]\n\n\n\n\n\n\n\n\nPlotting the win percent by rest_spread:\n\nPolarsPandas\n\n\n\nimport matplotlib.pyplot as plt\nfig, ax = plt.subplots(figsize=(9, 6))\nsns.barplot(\n x=\"rest_spread\",\n y=\"home_win\",\n data=joined_pl.filter(pl.col(\"rest_spread\").is_between(-3, 3, closed=\"both\")),\n color=\"#4c72b0\",\n ax=ax,\n)\n\n\n\n\n\n\n\n\n\n\nfig, ax = plt.subplots(figsize=(9, 6))\nsns.barplot(\n x=\"rest_spread\",\n y=\"home_win\",\n data=joined_pd.query('-3 <= rest_spread <= 3'),\n color=\"#4c72b0\",\n ax=ax,\n)\n\n"
},
{
"objectID": "tidy.html#stack-unstack-vs-melt-pivot",
@@ -207,7 +207,7 @@
"href": "tidy.html#some-visualisations",
"title": "4 Reshaping and Tidy Data",
"section": "4.7 Some visualisations",
- "text": "4.7 Some visualisations\n\ng = sns.FacetGrid(wins_pl, hue=\"team\", aspect=0.8, palette=[\"k\"], height=5)\ng.map(\n sns.pointplot,\n \"is_home\",\n \"win_pct\",\n order=[\"away_team\", \"home_team\"]).set(ylim=(0, 1))\n\n\n\n\n\n\n\n\nsns.catplot(\n wins_pl,\n x=\"is_home\",\n y=\"win_pct\",\n col=\"team\",\n col_wrap=5,\n hue=\"team\",\n kind=\"point\",\n height=1.5,\n)\n\n\n\n\n\n\n\nNow we calculate the win percent by team, regardless of whether they’re home or away:\n\nPolarsPandas\n\n\n\nwin_percent_pl = (\n wins_pl.group_by(\"team\", maintain_order=True).agg(\n pl.col(\"n_wins\").sum().alias(\"win_pct\") / pl.col(\"n_games\").sum()\n )\n)\nwin_percent_pl\n\n\n\nshape: (30, 2)teamwin_pctstrf64\"Atlanta Hawks\"0.571429\"Boston Celtics…0.563218\"Brooklyn Nets\"0.256098……\"Utah Jazz\"0.487805\"Washington Wiz…0.493827\n\n\n\n\n\nwin_percent_pd = (\n wins_pd\n .groupby(level=\"team\", as_index=True)\n .apply(lambda x: x[\"n_wins\"].sum() / x[\"n_games\"].sum())\n)\nwin_percent_pd\n\nteam\nAtlanta Hawks 0.571429\nBoston Celtics 0.563218\n ... \nUtah Jazz 0.487805\nWashington Wizards 0.493827\nLength: 30, dtype: float64\n\n\n\n\n\n\n(\n win_percent_pl\n .sort(\"win_pct\")\n .to_pandas()\n .set_index(\"team\")\n .plot.barh(figsize=(6, 12), width=0.85, color=\"k\")\n)\nplt.xlabel(\"Win Percent\")\n\nText(0.5, 0, 'Win Percent')\n\n\n\n\n\nHere’s a plot of team home court advantage against team overall win percentage:\n\nPolarsPandas\n\n\n\nwins_to_plot_pl = (\n wins_pl.pivot(index=\"team\", columns=\"is_home\", values=\"win_pct\")\n .with_columns(\n [\n pl.col(\"home_team\").alias(\"Home Win % - Away %\") - pl.col(\"away_team\"),\n (pl.col(\"home_team\").alias(\"Overall %\") + pl.col(\"away_team\")) / 2,\n ]\n )\n)\nsns.regplot(data=wins_to_plot_pl, x='Overall %', y='Home Win % - Away %')\n\n\n\n\n\n\n\n\n\n\nwins_to_plot_pd = (\n wins_pd\n [\"win_pct\"]\n .unstack()\n .assign(**{'Home Win % - Away %': lambda x: x[\"home_team\"] - x[\"away_team\"],\n 'Overall %': lambda x: (x[\"home_team\"] + x[\"away_team\"]) / 2})\n)\nsns.regplot(data=wins_to_plot_pd, x='Overall %', y='Home Win % - Away %')\n\n\n\n\n\n\n\n\n\n\nLet’s add the win percent back to the dataframe and run a regression:\n\nPolarsPandas\n\n\n\nreg_df_pl = (\n joined_pl.join(win_percent_pl, left_on=\"home_team\", right_on=\"team\")\n .rename({\"win_pct\": \"home_strength\"})\n .join(win_percent_pl, left_on=\"away_team\", right_on=\"team\")\n .rename({\"win_pct\": \"away_strength\"})\n .with_columns(\n [\n pl.col(\"home_points\").alias(\"point_diff\") - pl.col(\"away_points\"),\n pl.col(\"home_rest\").alias(\"rest_diff\") - pl.col(\"away_rest\"),\n pl.col(\"home_win\").cast(pl.UInt8), # for statsmodels\n ]\n )\n)\nreg_df_pl.head()\n\n\n\nshape: (5, 14)game_iddateaway_resthome_restaway_teamaway_pointshome_teamhome_pointshome_winrest_spreadhome_strengthaway_strengthpoint_diffrest_diffu32datei8i8stri64stri64u8i8f64f64i64i852015-10-280null\"Chicago Bulls\"115\"Brooklyn Nets\"1000null0.2560980.506173-15null62015-10-28null0\"Utah Jazz\"87\"Detroit Piston…921null0.5058820.4878055null112015-10-280null\"Cleveland Cava…106\"Memphis Grizzl…760null0.4883720.715686-30null142015-10-280null\"New Orleans Pe…94\"Portland Trail…1121null0.5268820.3703718null172015-10-2900\"Memphis Grizzl…112\"Indiana Pacers…103000.5454550.488372-90\n\n\n\n\n\nreg_df_pd = (\n joined_pd.assign(\n away_strength=joined_pd['away_team'].map(win_percent_pd),\n home_strength=joined_pd['home_team'].map(win_percent_pd),\n point_diff=joined_pd['home_points'] - joined_pd['away_points'],\n rest_diff=joined_pd['home_rest'] - joined_pd['away_rest'])\n)\nreg_df_pd.head()\n\n\n\n\n\n \n \n \n \n away_rest\n home_rest\n away_team\n away_points\n home_team\n home_points\n home_win\n rest_spread\n away_strength\n home_strength\n point_diff\n rest_diff\n \n \n game_id\n date\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n 7\n 2015-10-28\n 0\n \n New Orleans Pelicans\n 94\n Portland Trail Blazers\n 112\n True\n \n 0.370370\n 0.526882\n 18\n \n \n \n 11\n 2015-10-28\n 0\n \n Chicago Bulls\n 115\n Brooklyn Nets\n 100\n False\n \n 0.506173\n 0.256098\n -15\n \n \n \n 15\n 2015-10-28\n \n 0\n Utah Jazz\n 87\n Detroit Pistons\n 92\n True\n \n 0.487805\n 0.505882\n 5\n \n \n \n 16\n 2015-10-28\n 0\n \n Cleveland Cavaliers\n 106\n Memphis Grizzlies\n 76\n False\n \n 0.715686\n 0.488372\n -30\n \n \n \n 17\n 2015-10-29\n 1\n 0\n Atlanta Hawks\n 112\n New York Knicks\n 101\n False\n -1\n 0.571429\n 0.382716\n -11\n -1\n \n \n\n\n\n\n\n\n\n\nimport statsmodels.formula.api as sm\n\nmod = sm.logit(\n \"home_win ~ home_strength + away_strength + home_rest + away_rest\",\n reg_df_pl.to_pandas(),\n)\nres = mod.fit()\nres.summary()\n\nOptimization terminated successfully.\n Current function value: 0.554797\n Iterations 6\n\n\n\n\nLogit Regression Results\n\n Dep. Variable: home_win No. Observations: 1299 \n\n\n Model: Logit Df Residuals: 1294 \n\n\n Method: MLE Df Model: 4 \n\n\n Date: Sun, 24 Mar 2024 Pseudo R-squ.: 0.1777 \n\n\n Time: 17:15:15 Log-Likelihood: -720.68 \n\n\n converged: True LL-Null: -876.38 \n\n\n Covariance Type: nonrobust LLR p-value: 3.748e-66\n\n\n\n\n coef std err z P>|z| [0.025 0.975] \n\n\n Intercept -0.0019 0.304 -0.006 0.995 -0.597 0.593\n\n\n home_strength 5.7161 0.466 12.272 0.000 4.803 6.629\n\n\n away_strength -4.9133 0.456 -10.786 0.000 -5.806 -4.020\n\n\n home_rest 0.1045 0.076 1.381 0.167 -0.044 0.253\n\n\n away_rest -0.0347 0.066 -0.526 0.599 -0.164 0.095\n\n\n\n\nYou can play around with the regressions yourself but we’ll end them here."
+ "text": "4.7 Some visualisations\n\ng = sns.FacetGrid(wins_pl, hue=\"team\", aspect=0.8, palette=[\"k\"], height=5)\ng.map(\n sns.pointplot,\n \"is_home\",\n \"win_pct\",\n order=[\"away_team\", \"home_team\"]).set(ylim=(0, 1))\n\n\n\n\n\n\n\n\nsns.catplot(\n wins_pl,\n x=\"is_home\",\n y=\"win_pct\",\n col=\"team\",\n col_wrap=5,\n hue=\"team\",\n kind=\"point\",\n height=1.5,\n)\n\n\n\n\n\n\n\nNow we calculate the win percent by team, regardless of whether they’re home or away:\n\nPolarsPandas\n\n\n\nwin_percent_pl = (\n wins_pl.group_by(\"team\", maintain_order=True).agg(\n pl.col(\"n_wins\").sum().alias(\"win_pct\") / pl.col(\"n_games\").sum()\n )\n)\nwin_percent_pl\n\n\n\nshape: (30, 2)teamwin_pctstrf64\"Atlanta Hawks\"0.571429\"Boston Celtics…0.563218\"Brooklyn Nets\"0.256098……\"Utah Jazz\"0.487805\"Washington Wiz…0.493827\n\n\n\n\n\nwin_percent_pd = (\n wins_pd\n .groupby(level=\"team\", as_index=True)\n .apply(lambda x: x[\"n_wins\"].sum() / x[\"n_games\"].sum())\n)\nwin_percent_pd\n\nteam\nAtlanta Hawks 0.571429\nBoston Celtics 0.563218\n ... \nUtah Jazz 0.487805\nWashington Wizards 0.493827\nLength: 30, dtype: float64\n\n\n\n\n\n\n(\n win_percent_pl\n .sort(\"win_pct\")\n .to_pandas()\n .set_index(\"team\")\n .plot.barh(figsize=(6, 12), width=0.85, color=\"k\")\n)\nplt.xlabel(\"Win Percent\")\n\nText(0.5, 0, 'Win Percent')\n\n\n\n\n\nHere’s a plot of team home court advantage against team overall win percentage:\n\nPolarsPandas\n\n\n\nwins_to_plot_pl = (\n wins_pl.pivot(index=\"team\", columns=\"is_home\", values=\"win_pct\")\n .with_columns(\n [\n pl.col(\"home_team\").alias(\"Home Win % - Away %\") - pl.col(\"away_team\"),\n (pl.col(\"home_team\").alias(\"Overall %\") + pl.col(\"away_team\")) / 2,\n ]\n )\n)\nsns.regplot(data=wins_to_plot_pl, x='Overall %', y='Home Win % - Away %')\n\n\n\n\n\n\n\n\n\n\nwins_to_plot_pd = (\n wins_pd\n [\"win_pct\"]\n .unstack()\n .assign(**{'Home Win % - Away %': lambda x: x[\"home_team\"] - x[\"away_team\"],\n 'Overall %': lambda x: (x[\"home_team\"] + x[\"away_team\"]) / 2})\n)\nsns.regplot(data=wins_to_plot_pd, x='Overall %', y='Home Win % - Away %')\n\n\n\n\n\n\n\n\n\n\nLet’s add the win percent back to the dataframe and run a regression:\n\nPolarsPandas\n\n\n\nreg_df_pl = (\n joined_pl.join(win_percent_pl, left_on=\"home_team\", right_on=\"team\")\n .rename({\"win_pct\": \"home_strength\"})\n .join(win_percent_pl, left_on=\"away_team\", right_on=\"team\")\n .rename({\"win_pct\": \"away_strength\"})\n .with_columns(\n [\n pl.col(\"home_points\").alias(\"point_diff\") - pl.col(\"away_points\"),\n pl.col(\"home_rest\").alias(\"rest_diff\") - pl.col(\"away_rest\"),\n pl.col(\"home_win\").cast(pl.UInt8), # for statsmodels\n ]\n )\n)\nreg_df_pl.head()\n\n\n\nshape: (5, 14)game_iddateaway_resthome_restaway_teamaway_pointshome_teamhome_pointshome_winrest_spreadhome_strengthaway_strengthpoint_diffrest_diffu32datei8i8stri64stri64u8i8f64f64i64i852015-10-280null\"Chicago Bulls\"115\"Brooklyn Nets\"1000null0.2560980.506173-15null62015-10-28null0\"Utah Jazz\"87\"Detroit Piston…921null0.5058820.4878055null112015-10-280null\"Cleveland Cava…106\"Memphis Grizzl…760null0.4883720.715686-30null142015-10-280null\"New Orleans Pe…94\"Portland Trail…1121null0.5268820.3703718null172015-10-2900\"Memphis Grizzl…112\"Indiana Pacers…103000.5454550.488372-90\n\n\n\n\n\nreg_df_pd = (\n joined_pd.assign(\n away_strength=joined_pd['away_team'].map(win_percent_pd),\n home_strength=joined_pd['home_team'].map(win_percent_pd),\n point_diff=joined_pd['home_points'] - joined_pd['away_points'],\n rest_diff=joined_pd['home_rest'] - joined_pd['away_rest'])\n)\nreg_df_pd.head()\n\n\n\n\n\n \n \n \n \n away_rest\n home_rest\n away_team\n away_points\n home_team\n home_points\n home_win\n rest_spread\n away_strength\n home_strength\n point_diff\n rest_diff\n \n \n game_id\n date\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n 7\n 2015-10-28\n 0\n \n New Orleans Pelicans\n 94\n Portland Trail Blazers\n 112\n True\n \n 0.370370\n 0.526882\n 18\n \n \n \n 11\n 2015-10-28\n 0\n \n Chicago Bulls\n 115\n Brooklyn Nets\n 100\n False\n \n 0.506173\n 0.256098\n -15\n \n \n \n 15\n 2015-10-28\n \n 0\n Utah Jazz\n 87\n Detroit Pistons\n 92\n True\n \n 0.487805\n 0.505882\n 5\n \n \n \n 16\n 2015-10-28\n 0\n \n Cleveland Cavaliers\n 106\n Memphis Grizzlies\n 76\n False\n \n 0.715686\n 0.488372\n -30\n \n \n \n 17\n 2015-10-29\n 1\n 0\n Atlanta Hawks\n 112\n New York Knicks\n 101\n False\n -1\n 0.571429\n 0.382716\n -11\n -1\n \n \n\n\n\n\n\n\n\n\nimport statsmodels.formula.api as sm\n\nmod = sm.logit(\n \"home_win ~ home_strength + away_strength + home_rest + away_rest\",\n reg_df_pl.to_pandas(),\n)\nres = mod.fit()\nres.summary()\n\nOptimization terminated successfully.\n Current function value: 0.554797\n Iterations 6\n\n\n\n\nLogit Regression Results\n\n Dep. Variable: home_win No. Observations: 1299 \n\n\n Model: Logit Df Residuals: 1294 \n\n\n Method: MLE Df Model: 4 \n\n\n Date: Sun, 24 Mar 2024 Pseudo R-squ.: 0.1777 \n\n\n Time: 21:59:11 Log-Likelihood: -720.68 \n\n\n converged: True LL-Null: -876.38 \n\n\n Covariance Type: nonrobust LLR p-value: 3.748e-66\n\n\n\n\n coef std err z P>|z| [0.025 0.975] \n\n\n Intercept -0.0019 0.304 -0.006 0.995 -0.597 0.593\n\n\n home_strength 5.7161 0.466 12.272 0.000 4.803 6.629\n\n\n away_strength -4.9133 0.456 -10.786 0.000 -5.806 -4.020\n\n\n home_rest 0.1045 0.076 1.381 0.167 -0.044 0.253\n\n\n away_rest -0.0347 0.066 -0.526 0.599 -0.164 0.095\n\n\n\n\nYou can play around with the regressions yourself but we’ll end them here."
},
{
"objectID": "tidy.html#summary",
@@ -242,7 +242,7 @@
"href": "timeseries.html#rolling-expanding-ew",
"title": "5 Timeseries",
"section": "5.4 Rolling / Expanding / EW",
- "text": "5.4 Rolling / Expanding / EW\nPolars supports all three of these but they’re not quite as powerful as in Pandas, since they don’t have as many different methods. The expanding support is more limited again, though there are workarounds for this (see below):\n\nPolarsPandas\n\n\n\nclose = pl.col(\"close\")\nohlcv_pl.select(\n [\n pl.col(\"time\"),\n close.alias(\"Raw\"),\n close.rolling_mean(28).alias(\"28D MA\"),\n close.alias(\"Expanding Average\").cum_sum() / (close.cum_count() + 1),\n close.ewm_mean(alpha=0.03).alias(\"EWMA($\\\\alpha=.03$)\"),\n ]\n).to_pandas().set_index(\"time\").plot()\n\nplt.ylabel(\"Close ($)\")\n\n/tmp/ipykernel_14552/3570598896.py:8: DeprecationWarning:\n\nThe default value for `ignore_nulls` for `ewm` methods will change from True to False in the next breaking release. Explicitly set `ignore_nulls=True` to keep the existing behavior and silence this warning.\n\n\n\nText(0, 0.5, 'Close ($)')\n\n\n\n\n\n\n\n\nohlcv_pd[\"close\"].plot(label=\"Raw\")\nohlcv_pd[\"close\"].rolling(28).mean().plot(label=\"28D MA\")\nohlcv_pd[\"close\"].expanding().mean().plot(label=\"Expanding Average\")\nohlcv_pd[\"close\"].ewm(alpha=0.03).mean().plot(label=\"EWMA($\\\\alpha=.03$)\")\n\nplt.legend(bbox_to_anchor=(0.63, 0.27))\nplt.ylabel(\"Close ($)\")\n\nText(0, 0.5, 'Close ($)')\n\n\n\n\n\n\n\n\nPolars doesn’t have an expanding_mean yet so we make do by combining cumsum and cumcount.\n\n5.4.1 Combining rolling aggregations\n\nPolarsPandas\n\n\n\nmean_std_pl = ohlcv_pl.select(\n [\n \"time\",\n pl.col(\"close\").rolling_mean(30, center=True).alias(\"mean\"),\n pl.col(\"close\").rolling_std(30, center=True).alias(\"std\"),\n ]\n)\nax = mean_std_pl.to_pandas().set_index(\"time\")[\"mean\"].plot()\nax.fill_between(\n mean_std_pl[\"time\"].to_numpy(),\n mean_std_pl[\"mean\"] - mean_std_pl[\"std\"],\n mean_std_pl[\"mean\"] + mean_std_pl[\"std\"],\n alpha=0.25,\n)\nplt.tight_layout()\nplt.ylabel(\"Close ($)\")\n\nText(26.83333333333334, 0.5, 'Close ($)')\n\n\n\n\n\n\n\n\nroll_pd = ohlcv_pd[\"close\"].rolling(30, center=True)\nmean_std_pd = roll_pd.agg([\"mean\", \"std\"])\nax = mean_std_pd[\"mean\"].plot()\nax.fill_between(\n mean_std_pd.index,\n mean_std_pd[\"mean\"] - mean_std_pd[\"std\"],\n mean_std_pd[\"mean\"] + mean_std_pd[\"std\"],\n alpha=0.25,\n)\nplt.tight_layout()\nplt.ylabel(\"Close ($)\")\n\nText(26.83333333333334, 0.5, 'Close ($)')"
+ "text": "5.4 Rolling / Expanding / EW\nPolars supports all three of these but they’re not quite as powerful as in Pandas, since they don’t have as many different methods. The expanding support is more limited again, though there are workarounds for this (see below):\n\nPolarsPandas\n\n\n\nclose = pl.col(\"close\")\nohlcv_pl.select(\n [\n pl.col(\"time\"),\n close.alias(\"Raw\"),\n close.rolling_mean(28).alias(\"28D MA\"),\n close.alias(\"Expanding Average\").cum_sum() / (close.cum_count() + 1),\n close.ewm_mean(alpha=0.03).alias(\"EWMA($\\\\alpha=.03$)\"),\n ]\n).to_pandas().set_index(\"time\").plot()\n\nplt.ylabel(\"Close ($)\")\n\n/tmp/ipykernel_28816/3570598896.py:8: DeprecationWarning:\n\nThe default value for `ignore_nulls` for `ewm` methods will change from True to False in the next breaking release. Explicitly set `ignore_nulls=True` to keep the existing behavior and silence this warning.\n\n\n\nText(0, 0.5, 'Close ($)')\n\n\n\n\n\n\n\n\nohlcv_pd[\"close\"].plot(label=\"Raw\")\nohlcv_pd[\"close\"].rolling(28).mean().plot(label=\"28D MA\")\nohlcv_pd[\"close\"].expanding().mean().plot(label=\"Expanding Average\")\nohlcv_pd[\"close\"].ewm(alpha=0.03).mean().plot(label=\"EWMA($\\\\alpha=.03$)\")\n\nplt.legend(bbox_to_anchor=(0.63, 0.27))\nplt.ylabel(\"Close ($)\")\n\nText(0, 0.5, 'Close ($)')\n\n\n\n\n\n\n\n\nPolars doesn’t have an expanding_mean yet so we make do by combining cumsum and cumcount.\n\n5.4.1 Combining rolling aggregations\n\nPolarsPandas\n\n\n\nmean_std_pl = ohlcv_pl.select(\n [\n \"time\",\n pl.col(\"close\").rolling_mean(30, center=True).alias(\"mean\"),\n pl.col(\"close\").rolling_std(30, center=True).alias(\"std\"),\n ]\n)\nax = mean_std_pl.to_pandas().set_index(\"time\")[\"mean\"].plot()\nax.fill_between(\n mean_std_pl[\"time\"].to_numpy(),\n mean_std_pl[\"mean\"] - mean_std_pl[\"std\"],\n mean_std_pl[\"mean\"] + mean_std_pl[\"std\"],\n alpha=0.25,\n)\nplt.tight_layout()\nplt.ylabel(\"Close ($)\")\n\nText(26.83333333333334, 0.5, 'Close ($)')\n\n\n\n\n\n\n\n\nroll_pd = ohlcv_pd[\"close\"].rolling(30, center=True)\nmean_std_pd = roll_pd.agg([\"mean\", \"std\"])\nax = mean_std_pd[\"mean\"].plot()\nax.fill_between(\n mean_std_pd.index,\n mean_std_pd[\"mean\"] - mean_std_pd[\"std\"],\n mean_std_pd[\"mean\"] + mean_std_pd[\"std\"],\n alpha=0.25,\n)\nplt.tight_layout()\nplt.ylabel(\"Close ($)\")\n\nText(26.83333333333334, 0.5, 'Close ($)')"
},
{
"objectID": "timeseries.html#grab-bag",
@@ -277,7 +277,7 @@
"href": "scaling.html#executing-multiple-queries-in-parallel",
"title": "6 Scaling",
"section": "6.3 Executing multiple queries in parallel",
- "text": "6.3 Executing multiple queries in parallel\nOften we want to generate multiple insights from the same data, and we need them in separate dataframes. In this case, using collect_all is more efficient than calling .collect multiple times, because Polars can avoid repeating common operations like reading the data.\nLet’s compute the average donation size, the total donated by employer and the average donation by occupation:\n\nPolarsDask\n\n\n\n%%time\nindiv_pl = pl.scan_parquet(fec_dir / \"indiv*.pq\")\navg_transaction_lazy_pl = indiv_pl.select(pl.col(\"TRANSACTION_AMT\").mean())\ntotal_by_employer_lazy_pl = (\n indiv_pl.drop_nulls(\"EMPLOYER\")\n .group_by(\"EMPLOYER\")\n .agg([pl.col(\"TRANSACTION_AMT\").sum()])\n .sort(\"TRANSACTION_AMT\", descending=True)\n .head(10)\n)\navg_by_occupation_lazy_pl = (\n indiv_pl.group_by(\"OCCUPATION\")\n .agg([pl.col(\"TRANSACTION_AMT\").mean()])\n .sort(\"TRANSACTION_AMT\", descending=True)\n .head(10)\n)\n\navg_transaction_pl, total_by_employer_pl, avg_by_occupation_pl = pl.collect_all(\n [avg_transaction_lazy_pl, total_by_employer_lazy_pl, avg_by_occupation_lazy_pl],\n streaming=True,\n comm_subplan_elim=False, # cannot use CSE with streaming\n)\n\nCPU times: user 14.7 s, sys: 2.99 s, total: 17.7 s\nWall time: 5.78 s\n\n\n\n\n\n%%time\nindiv_dd = (\n dd.read_parquet(fec_dir / \"indiv*.pq\", engine=\"pyarrow\")\n # pandas and dask want datetimes but this is a date col\n .assign(\n TRANSACTION_DT=lambda df: dd.to_datetime(df[\"TRANSACTION_DT\"], errors=\"coerce\")\n )\n)\navg_transaction_lazy_dd = indiv_dd[\"TRANSACTION_AMT\"].mean()\ntotal_by_employer_lazy_dd = (\n indiv_dd.groupby(\"EMPLOYER\", observed=True)[\"TRANSACTION_AMT\"].sum().nlargest(10)\n)\navg_by_occupation_lazy_dd = (\n indiv_dd.groupby(\"OCCUPATION\", observed=True)[\"TRANSACTION_AMT\"].mean().nlargest(10)\n)\navg_transaction_dd, total_by_employer_dd, avg_by_occupation_dd = compute(\n avg_transaction_lazy_dd, total_by_employer_lazy_dd, avg_by_occupation_lazy_dd\n)\n\nCPU times: user 25.8 s, sys: 3.92 s, total: 29.8 s\nWall time: 17.2 s\n\n\n\n\n\nThe Polars code above tends to be ~3.5x faster than Dask on my machine, which if anything is a smaller speedup than I expected.\nWe should also profile memory usage, since it could be the case that Polars is just running faster because it’s reading in bigger chunks. According to the fil profiler, the Dask example’s memory usage peaks at 1450 MiB, while Polars uses ~10% more than that.\nBefore I forget, here are the results of our computations:\n\n6.3.1 avg_transaction\n\nPolarsDask\n\n\n\navg_transaction_pl\n\n\n\nshape: (1, 1)TRANSACTION_AMTf64563.97184\n\n\n\n\n\navg_transaction_dd\n\n563.9718398183915\n\n\n\n\n\n\n\n6.3.2 total_by_employer\n\nPolarsDask\n\n\n\ntotal_by_employer_pl\n\n\n\nshape: (10, 2)EMPLOYERTRANSACTION_AMTcati32\"RETIRED\"1023306104\"SELF-EMPLOYED\"834757599\"N/A\"688186834……\"FAHR, LLC\"166679844\"CANDIDATE\"75187243\n\n\n\n\n\ntotal_by_employer_dd\n\nEMPLOYER\nRETIRED 1023306104\nSELF-EMPLOYED 834757599\n ... \nFAHR, LLC 166679844\nCANDIDATE 75187243\nName: TRANSACTION_AMT, Length: 10, dtype: int32\n\n\n\n\n\n\n\n6.3.3 avg_by_occupation\n\nPolarsDask\n\n\n\navg_by_occupation_pl\n\n\n\nshape: (10, 2)OCCUPATIONTRANSACTION_AMTcatf64\"CHAIRMAN CEO &…1.0233e6\"PAULSON AND CO…1e6\"CO-FOUNDING DI…875000.0……\"CHIEF EXECUTIV…500000.0\"MOORE CAPITAL …500000.0\n\n\n\n\n\navg_by_occupation_dd\n\nOCCUPATION\nCHAIRMAN CEO & FOUNDER 1.023333e+06\nPAULSON AND CO., INC. 1.000000e+06\n ... \nOWNER, FOUNDER AND CEO 5.000000e+05\nCHIEF EXECUTIVE OFFICER/PRODUCER 5.000000e+05\nName: TRANSACTION_AMT, Length: 10, dtype: float64"
+ "text": "6.3 Executing multiple queries in parallel\nOften we want to generate multiple insights from the same data, and we need them in separate dataframes. In this case, using collect_all is more efficient than calling .collect multiple times, because Polars can avoid repeating common operations like reading the data.\nLet’s compute the average donation size, the total donated by employer and the average donation by occupation:\n\nPolarsDask\n\n\n\n%%time\nindiv_pl = pl.scan_parquet(fec_dir / \"indiv*.pq\")\navg_transaction_lazy_pl = indiv_pl.select(pl.col(\"TRANSACTION_AMT\").mean())\ntotal_by_employer_lazy_pl = (\n indiv_pl.drop_nulls(\"EMPLOYER\")\n .group_by(\"EMPLOYER\")\n .agg([pl.col(\"TRANSACTION_AMT\").sum()])\n .sort(\"TRANSACTION_AMT\", descending=True)\n .head(10)\n)\navg_by_occupation_lazy_pl = (\n indiv_pl.group_by(\"OCCUPATION\")\n .agg([pl.col(\"TRANSACTION_AMT\").mean()])\n .sort(\"TRANSACTION_AMT\", descending=True)\n .head(10)\n)\n\navg_transaction_pl, total_by_employer_pl, avg_by_occupation_pl = pl.collect_all(\n [avg_transaction_lazy_pl, total_by_employer_lazy_pl, avg_by_occupation_lazy_pl],\n streaming=True,\n comm_subplan_elim=False, # cannot use CSE with streaming\n)\n\nCPU times: user 13.4 s, sys: 2.66 s, total: 16 s\nWall time: 5.09 s\n\n\n\n\n\n%%time\nindiv_dd = (\n dd.read_parquet(fec_dir / \"indiv*.pq\", engine=\"pyarrow\")\n # pandas and dask want datetimes but this is a date col\n .assign(\n TRANSACTION_DT=lambda df: dd.to_datetime(df[\"TRANSACTION_DT\"], errors=\"coerce\")\n )\n)\navg_transaction_lazy_dd = indiv_dd[\"TRANSACTION_AMT\"].mean()\ntotal_by_employer_lazy_dd = (\n indiv_dd.groupby(\"EMPLOYER\", observed=True)[\"TRANSACTION_AMT\"].sum().nlargest(10)\n)\navg_by_occupation_lazy_dd = (\n indiv_dd.groupby(\"OCCUPATION\", observed=True)[\"TRANSACTION_AMT\"].mean().nlargest(10)\n)\navg_transaction_dd, total_by_employer_dd, avg_by_occupation_dd = compute(\n avg_transaction_lazy_dd, total_by_employer_lazy_dd, avg_by_occupation_lazy_dd\n)\n\nCPU times: user 25.5 s, sys: 3.99 s, total: 29.5 s\nWall time: 17 s\n\n\n\n\n\nThe Polars code above tends to be ~3.5x faster than Dask on my machine, which if anything is a smaller speedup than I expected.\nWe should also profile memory usage, since it could be the case that Polars is just running faster because it’s reading in bigger chunks. According to the fil profiler, the Dask example’s memory usage peaks at 1450 MiB, while Polars uses ~10% more than that.\nBefore I forget, here are the results of our computations:\n\n6.3.1 avg_transaction\n\nPolarsDask\n\n\n\navg_transaction_pl\n\n\n\nshape: (1, 1)TRANSACTION_AMTf64563.97184\n\n\n\n\n\navg_transaction_dd\n\n563.9718398183915\n\n\n\n\n\n\n\n6.3.2 total_by_employer\n\nPolarsDask\n\n\n\ntotal_by_employer_pl\n\n\n\nshape: (10, 2)EMPLOYERTRANSACTION_AMTcati32\"RETIRED\"1023306104\"SELF-EMPLOYED\"834757599\"N/A\"688186834……\"FAHR, LLC\"166679844\"CANDIDATE\"75187243\n\n\n\n\n\ntotal_by_employer_dd\n\nEMPLOYER\nRETIRED 1023306104\nSELF-EMPLOYED 834757599\n ... \nFAHR, LLC 166679844\nCANDIDATE 75187243\nName: TRANSACTION_AMT, Length: 10, dtype: int32\n\n\n\n\n\n\n\n6.3.3 avg_by_occupation\n\nPolarsDask\n\n\n\navg_by_occupation_pl\n\n\n\nshape: (10, 2)OCCUPATIONTRANSACTION_AMTcatf64\"CHAIRMAN CEO &…1.0233e6\"PAULSON AND CO…1e6\"CO-FOUNDING DI…875000.0……\"MOORE CAPITAL …500000.0\"PERRY HOMES\"500000.0\n\n\n\n\n\navg_by_occupation_dd\n\nOCCUPATION\nCHAIRMAN CEO & FOUNDER 1.023333e+06\nPAULSON AND CO., INC. 1.000000e+06\n ... \nOWNER, FOUNDER AND CEO 5.000000e+05\nCHIEF EXECUTIVE OFFICER/PRODUCER 5.000000e+05\nName: TRANSACTION_AMT, Length: 10, dtype: float64"
},
{
"objectID": "scaling.html#filtering",
diff --git a/tidy.html b/tidy.html
index 5ae50a7..f9486c4 100644
--- a/tidy.html
+++ b/tidy.html
@@ -362,15 +362,9 @@
/tmp/ipykernel_14427/3011805186.py:7: DeprecationWarning:
-
-`with_row_count` is deprecated. Use `with_row_index` instead. Note that the default column name has changed from 'row_nr' to 'index'.
-