Refactor the `data.init.py` module #525

jrycw · 2024-11-28T14:00:18Z

Hello team,

This PR aims to address the Pandas dependency in reading datasets by introducing a unified Dataset API. The proposed approach allows users to retrieve datasets in a user-specified dataframe format. For example:

To get a Pandas dataframe for the sza dataset, use Dataset.sza.to_pandas().
To get a Polars dataframe for the same dataset, use Dataset.sza.to_polars().
To get a PyArrow dataframe for the same dataset, use Dataset.sza.to_pyarrow() (implementation of _convert_to_pyarrow() and to_pyarrow() is needed to support this).

This way, users can use autocomplete to select both the dataset and the desired dataframe type.

To facilitate the transition:

Each dataset name begins with an uppercase letter (e.g., Sza).
A lowercase variable (e.g., sza) is provided as a Pandas dataframe, created using to_pandas().

If we decide to completely remove Pandas as a dependency in the future, the following tasks will be required:

Remove the warning message.
Remove the lowercase variables representing Pandas dataframes, and rename the dataset classes to lowercase.
Update __all__.
Update documentation and tests to reflect the new API.
Address code sections marked with # remove pandas for further cleanup.

I’m confident there are other excellent approaches to tackle this issue, so please feel free to modify or reject this PR as needed.

codecov · 2024-11-28T14:02:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.77%. Comparing base (925fa41) to head (4486114).
Report is 6 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #525      +/-   ##
==========================================
- Coverage   89.79%   89.77%   -0.03%     
==========================================
  Files          45       45              
  Lines        5321     5406      +85     
==========================================
+ Hits         4778     4853      +75     
- Misses        543      553      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jrycw · 2024-11-28T14:19:47Z

The Dataset API can be used as follows:

from great_tables import GT
from great_tables.data import Dataset

exibble = Dataset.exibble

# Pandas users
df_pd = exibble.to_pandas()
GT(df_pd).show()


# Polars users
df_pl = exibble.to_polars()
GT(df_pl).show()


# PyArrow users
...

machow

Hey, thanks for working on this! This is a huge step towards not depending on a specific DataFrame library for our examples!

One thing I wonder is whether it might help cut out boilerplate to use a style with a bit more composition, rather than class inheritance.

machow · 2024-12-04T22:08:34Z

great_tables/data/__init__.py

+    def to_pyarrow(cls): ...
+
+
+class Countrypops(_DataFrameConverter):


Currently, this approach works by having the class _DataFrameConverter...

specifying a filepath class attribute, etc..

subclassing it to define those attributes

One risk of this approach is that a leans a bit hard on inheritance (see composition over inheritance). WDYT of an approach where things like Countrypops a instances of _DataFrameConverter, or something similar?

In this case you could add init arguments to _DataFrameConverter, like...

class _DataFrameConverter: def __init__(self, filepath, dtype): self.filepath = filepath self.dtype = dtype Countrypops = _DataFrameConverter( DATA_MOD / "01-countrypops.csv", dtype = { "country_name": str, "country_code_2": str, "country_code_3": str, "year": int, "population": int, } ) """A docstring"""

@machow , thank you for the excellent suggestion! Since we're moving towards favoring composition, renaming _DataFrameConverter to _Dataset seems more appropriate, and passing the docstrings as arguments aligns well with this approach. In addition, I've taken the liberty of implementing the Pyarrow conversion and added a __repr__() method, though I'm uncertain about its utility—feedback on this would be appreciated.

For the internal datasets, I recommend retaining the current parsing approach with Pandas. It appears to be more robust, and I've noticed that handling certain cases with Polars, such as x-airquality.csv (missing values) and x_locales.csv (complex types, which Pandas handles well with the object dtype) is a bit tricky. Since these datasets are used internally, we should have room to explore better solutions for these edge cases.

machow · 2024-12-04T22:09:39Z

great_tables/data/__init__.py

-    "mass_excess_uncert": "float64",
-}
+
+class _DataFrameConverter:


In case it's useful, I ended up creating a tiny DataFrame like implementation in reactable-py, just so I could feed example data for demos

https://github.com/machow/reactable-py/blob/main/reactable/simpleframe.py

I think a nice advantage of your approach here though is that it doesn't read the csv until you use one of the to_*() methods!

The code looks great! It seems you'll need to keep the data as an internal variable within SimpleFrame to enable many get and set actions. However, for our dataset use case, we should be able to store just the filepath and dtype rather than holding the entire dataset in memory.

jrycw · 2024-12-08T05:47:54Z

The CI failure seems unrelated to our codebase.

Run browser-actions/setup-edge@v1
Setup Edge stable
Attempting to download Edge stable...
Error: Artifact not found of Edge stable for platform linux amd6[4](https://github.com/posit-dev/great-tables/actions/runs/12219134318/job/34085425230#step:6:5)

machow · 2024-12-10T16:01:35Z

@rich-iannone do you mind taking a look at this? In particular, would love to get your take on...

What variable names should we have in great_tables.data. E.g. currently there is...
- data.Islands for the _Dataset instance, which has to_polars(), to_pandas() methods
- data.islands (note the lowercase i) for the `pd.DataFrame
What should the default be now, and in the long term? (see below)

Here are some possible approaches we could take to data.

Backwards compatible: keep pandas the default, provide new variables for `_Dataset` objects

e.g. data.Islands and data.islands both importable
I'd prefer we eventually just have 1 thing. We could always make a _Dataset instance run some to_*() method by default, when passed to GT(), if e.g. polars is installed, etc...

Agnostic: shift to `_Dataset` objects (e.g. `Islands`)

deprecate pandas-by-import:
- when users import variables like data.islands
- give a warning that its deprecated
- suggest user imports data.Islands and runs the .to_pandas() method

Agnostic: shift to `_Dataset` and eventually rename (e.g. `Islands` -> `islands`)

deprecate pandas-by-import, with alternative import path: when users import variables like data.islands
- when users import variables like data.islands
- give a warning that its deprecated
- suggest user imports data.v0.islands
- provide to_pandas(), to_polars(), to_...(), on data.islands -- as a subclass of pandas or something

rich-iannone · 2024-12-10T16:19:00Z

@machow out of the three options you provided, I prefer the third one: "shift to _Dataset and eventually rename (e.g. Islands -> islands)".

machow · 2024-12-10T18:22:21Z

Noice -- another way to do the third thing could be something like this?:

data.islands: current behavior, deprecate -- tell them to do v0 import.
data.v0.islands: old behavior
data.v1.islands: new behavior
X months from now, data.islands: new behavior (and we are fully dataframe agnostic!)

edit: eh, lemme think about this compared to the original third option 😭

jrycw · 2024-12-14T15:25:35Z

Include the previously discussed issue #91 as a reference.

jrycw added 2 commits November 28, 2024 21:51

Refactor the data.__init__.py module

2400d52

Update basic-styling.qmd

f8ac978

machow requested changes Dec 4, 2024

View reviewed changes

jrycw added 2 commits December 8, 2024 13:16

Refactor data.__init__.py to favor composition over inheritance

762ec6c

Update test_datasets.py to test the conversion of different dataframes

4486114

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the `data.init.py` module #525

Refactor the `data.init.py` module #525

jrycw commented Nov 28, 2024

codecov bot commented Nov 28, 2024 •

edited

Loading

jrycw commented Nov 28, 2024

machow left a comment

machow Dec 4, 2024

jrycw Dec 8, 2024 •

edited

Loading

machow Dec 4, 2024

jrycw Dec 8, 2024 •

edited

Loading

jrycw commented Dec 8, 2024

machow commented Dec 10, 2024 •

edited

Loading

rich-iannone commented Dec 10, 2024

machow commented Dec 10, 2024 •

edited

Loading

jrycw commented Dec 14, 2024

		def to_pyarrow(cls): ...


		class Countrypops(_DataFrameConverter):

Refactor the data.__init__.py module #525

Are you sure you want to change the base?

Refactor the data.__init__.py module #525

Conversation

jrycw commented Nov 28, 2024

codecov bot commented Nov 28, 2024 • edited Loading

Codecov Report

jrycw commented Nov 28, 2024

machow left a comment

Choose a reason for hiding this comment

machow Dec 4, 2024

Choose a reason for hiding this comment

jrycw Dec 8, 2024 • edited Loading

Choose a reason for hiding this comment

machow Dec 4, 2024

Choose a reason for hiding this comment

jrycw Dec 8, 2024 • edited Loading

Choose a reason for hiding this comment

jrycw commented Dec 8, 2024

machow commented Dec 10, 2024 • edited Loading

Backwards compatible: keep pandas the default, provide new variables for _Dataset objects

Agnostic: shift to _Dataset objects (e.g. Islands)

Agnostic: shift to _Dataset and eventually rename (e.g. Islands -> islands)

rich-iannone commented Dec 10, 2024

machow commented Dec 10, 2024 • edited Loading

jrycw commented Dec 14, 2024

Refactor the `data.init.py` module #525

Refactor the `data.init.py` module #525

codecov bot commented Nov 28, 2024 •

edited

Loading

jrycw Dec 8, 2024 •

edited

Loading

jrycw Dec 8, 2024 •

edited

Loading

machow commented Dec 10, 2024 •

edited

Loading

Backwards compatible: keep pandas the default, provide new variables for `_Dataset` objects

Agnostic: shift to `_Dataset` objects (e.g. `Islands`)

Agnostic: shift to `_Dataset` and eventually rename (e.g. `Islands` -> `islands`)

machow commented Dec 10, 2024 •

edited

Loading