Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the data.__init__.py module #525

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jrycw
Copy link
Collaborator

@jrycw jrycw commented Nov 28, 2024

Hello team,

This PR aims to address the Pandas dependency in reading datasets by introducing a unified Dataset API. The proposed approach allows users to retrieve datasets in a user-specified dataframe format. For example:

  • To get a Pandas dataframe for the sza dataset, use Dataset.sza.to_pandas().
  • To get a Polars dataframe for the same dataset, use Dataset.sza.to_polars().
  • To get a PyArrow dataframe for the same dataset, use Dataset.sza.to_pyarrow() (implementation of _convert_to_pyarrow() and to_pyarrow() is needed to support this).

This way, users can use autocomplete to select both the dataset and the desired dataframe type.

To facilitate the transition:

  • Each dataset name begins with an uppercase letter (e.g., Sza).
  • A lowercase variable (e.g., sza) is provided as a Pandas dataframe, created using to_pandas().

If we decide to completely remove Pandas as a dependency in the future, the following tasks will be required:

  • Remove the warning message.
  • Remove the lowercase variables representing Pandas dataframes, and rename the dataset classes to lowercase.
  • Update __all__.
  • Update documentation and tests to reflect the new API.
  • Address code sections marked with # remove pandas for further cleanup.

I’m confident there are other excellent approaches to tackle this issue, so please feel free to modify or reject this PR as needed.

Copy link

codecov bot commented Nov 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.77%. Comparing base (925fa41) to head (4486114).
Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #525      +/-   ##
==========================================
- Coverage   89.79%   89.77%   -0.03%     
==========================================
  Files          45       45              
  Lines        5321     5406      +85     
==========================================
+ Hits         4778     4853      +75     
- Misses        543      553      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jrycw
Copy link
Collaborator Author

jrycw commented Nov 28, 2024

The Dataset API can be used as follows:

from great_tables import GT
from great_tables.data import Dataset

exibble = Dataset.exibble

# Pandas users
df_pd = exibble.to_pandas()
GT(df_pd).show()


# Polars users
df_pl = exibble.to_polars()
GT(df_pl).show()


# PyArrow users
...

Copy link
Collaborator

@machow machow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for working on this! This is a huge step towards not depending on a specific DataFrame library for our examples!

One thing I wonder is whether it might help cut out boilerplate to use a style with a bit more composition, rather than class inheritance.

def to_pyarrow(cls): ...


class Countrypops(_DataFrameConverter):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this approach works by having the class _DataFrameConverter...

  • specifying a filepath class attribute, etc..
  • subclassing it to define those attributes

One risk of this approach is that a leans a bit hard on inheritance (see composition over inheritance). WDYT of an approach where things like Countrypops a instances of _DataFrameConverter, or something similar?

In this case you could add init arguments to _DataFrameConverter, like...

class _DataFrameConverter:
    def __init__(self, filepath, dtype):
        self.filepath = filepath
        self.dtype = dtype

Countrypops = _DataFrameConverter(
    DATA_MOD / "01-countrypops.csv",
    dtype = {
        "country_name": str,
        "country_code_2": str,
        "country_code_3": str,
        "year": int,
        "population": int,
    }    
)
"""A docstring"""

Copy link
Collaborator Author

@jrycw jrycw Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machow , thank you for the excellent suggestion! Since we're moving towards favoring composition, renaming _DataFrameConverter to _Dataset seems more appropriate, and passing the docstrings as arguments aligns well with this approach. In addition, I've taken the liberty of implementing the Pyarrow conversion and added a __repr__() method, though I'm uncertain about its utility—feedback on this would be appreciated.

For the internal datasets, I recommend retaining the current parsing approach with Pandas. It appears to be more robust, and I've noticed that handling certain cases with Polars, such as x-airquality.csv (missing values) and x_locales.csv (complex types, which Pandas handles well with the object dtype) is a bit tricky. Since these datasets are used internally, we should have room to explore better solutions for these edge cases.

"mass_excess_uncert": "float64",
}

class _DataFrameConverter:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it's useful, I ended up creating a tiny DataFrame like implementation in reactable-py, just so I could feed example data for demos

https://github.com/machow/reactable-py/blob/main/reactable/simpleframe.py

I think a nice advantage of your approach here though is that it doesn't read the csv until you use one of the to_*() methods!

Copy link
Collaborator Author

@jrycw jrycw Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great! It seems you'll need to keep the data as an internal variable within SimpleFrame to enable many get and set actions. However, for our dataset use case, we should be able to store just the filepath and dtype rather than holding the entire dataset in memory.

@jrycw
Copy link
Collaborator Author

jrycw commented Dec 8, 2024

The CI failure seems unrelated to our codebase.

Run browser-actions/setup-edge@v1
Setup Edge stable
Attempting to download Edge stable...
Error: Artifact not found of Edge stable for platform linux amd6[4](https://github.com/posit-dev/great-tables/actions/runs/12219134318/job/34085425230#step:6:5)

@machow
Copy link
Collaborator

machow commented Dec 10, 2024

@rich-iannone do you mind taking a look at this? In particular, would love to get your take on...

  • What variable names should we have in great_tables.data. E.g. currently there is...
    • data.Islands for the _Dataset instance, which has to_polars(), to_pandas() methods
    • data.islands (note the lowercase i) for the `pd.DataFrame
  • What should the default be now, and in the long term? (see below)

Here are some possible approaches we could take to data.

Backwards compatible: keep pandas the default, provide new variables for _Dataset objects

  • e.g. data.Islands and data.islands both importable
  • I'd prefer we eventually just have 1 thing. We could always make a _Dataset instance run some to_*() method by default, when passed to GT(), if e.g. polars is installed, etc...

Agnostic: shift to _Dataset objects (e.g. Islands)

  • deprecate pandas-by-import:
    • when users import variables like data.islands
    • give a warning that its deprecated
    • suggest user imports data.Islands and runs the .to_pandas() method

Agnostic: shift to _Dataset and eventually rename (e.g. Islands -> islands)

  • deprecate pandas-by-import, with alternative import path: when users import variables like data.islands
    • when users import variables like data.islands
    • give a warning that its deprecated
    • suggest user imports data.v0.islands
    • provide to_pandas(), to_polars(), to_...(), on data.islands -- as a subclass of pandas or something

@rich-iannone
Copy link
Member

@machow out of the three options you provided, I prefer the third one: "shift to _Dataset and eventually rename (e.g. Islands -> islands)".

@machow
Copy link
Collaborator

machow commented Dec 10, 2024

Noice -- another way to do the third thing could be something like this?:

  • data.islands: current behavior, deprecate -- tell them to do v0 import.
  • data.v0.islands: old behavior
  • data.v1.islands: new behavior
  • X months from now, data.islands: new behavior (and we are fully dataframe agnostic!)

edit: eh, lemme think about this compared to the original third option 😭

@jrycw
Copy link
Collaborator Author

jrycw commented Dec 14, 2024

Include the previously discussed issue #91 as a reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants