-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the data.__init__.py
module
#525
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #525 +/- ##
==========================================
- Coverage 89.79% 89.77% -0.03%
==========================================
Files 45 45
Lines 5321 5406 +85
==========================================
+ Hits 4778 4853 +75
- Misses 543 553 +10 ☔ View full report in Codecov by Sentry. |
The from great_tables import GT
from great_tables.data import Dataset
exibble = Dataset.exibble
# Pandas users
df_pd = exibble.to_pandas()
GT(df_pd).show()
# Polars users
df_pl = exibble.to_polars()
GT(df_pl).show()
# PyArrow users
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for working on this! This is a huge step towards not depending on a specific DataFrame library for our examples!
One thing I wonder is whether it might help cut out boilerplate to use a style with a bit more composition, rather than class inheritance.
great_tables/data/__init__.py
Outdated
def to_pyarrow(cls): ... | ||
|
||
|
||
class Countrypops(_DataFrameConverter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this approach works by having the class _DataFrameConverter...
- specifying a filepath class attribute, etc..
- subclassing it to define those attributes
One risk of this approach is that a leans a bit hard on inheritance (see composition over inheritance). WDYT of an approach where things like Countrypops a instances of _DataFrameConverter
, or something similar?
In this case you could add init arguments to _DataFrameConverter
, like...
class _DataFrameConverter:
def __init__(self, filepath, dtype):
self.filepath = filepath
self.dtype = dtype
Countrypops = _DataFrameConverter(
DATA_MOD / "01-countrypops.csv",
dtype = {
"country_name": str,
"country_code_2": str,
"country_code_3": str,
"year": int,
"population": int,
}
)
"""A docstring"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@machow , thank you for the excellent suggestion! Since we're moving towards favoring composition, renaming _DataFrameConverter
to _Dataset
seems more appropriate, and passing the docstrings as arguments aligns well with this approach. In addition, I've taken the liberty of implementing the Pyarrow
conversion and added a __repr__()
method, though I'm uncertain about its utility—feedback on this would be appreciated.
For the internal datasets, I recommend retaining the current parsing approach with Pandas
. It appears to be more robust, and I've noticed that handling certain cases with Polars
, such as x-airquality.csv
(missing values) and x_locales.csv
(complex types, which Pandas
handles well with the object
dtype) is a bit tricky. Since these datasets are used internally, we should have room to explore better solutions for these edge cases.
great_tables/data/__init__.py
Outdated
"mass_excess_uncert": "float64", | ||
} | ||
|
||
class _DataFrameConverter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case it's useful, I ended up creating a tiny DataFrame like implementation in reactable-py, just so I could feed example data for demos
https://github.com/machow/reactable-py/blob/main/reactable/simpleframe.py
I think a nice advantage of your approach here though is that it doesn't read the csv until you use one of the to_*()
methods!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks great! It seems you'll need to keep the data as an internal variable within SimpleFrame
to enable many get
and set
actions. However, for our dataset use case, we should be able to store just the filepath
and dtype
rather than holding the entire dataset in memory.
The CI failure seems unrelated to our codebase.
|
@rich-iannone do you mind taking a look at this? In particular, would love to get your take on...
Here are some possible approaches we could take to data. Backwards compatible: keep pandas the default, provide new variables for
|
@machow out of the three options you provided, I prefer the third one: "shift to |
Noice -- another way to do the third thing could be something like this?:
edit: eh, lemme think about this compared to the original third option 😭 |
Include the previously discussed issue #91 as a reference. |
Hello team,
This PR aims to address the
Pandas
dependency in reading datasets by introducing a unifiedDataset
API. The proposed approach allows users to retrieve datasets in a user-specified dataframe format. For example:Pandas
dataframe for thesza
dataset, useDataset.sza.to_pandas()
.Polars
dataframe for the same dataset, useDataset.sza.to_polars()
.PyArrow
dataframe for the same dataset, useDataset.sza.to_pyarrow()
(implementation of_convert_to_pyarrow()
andto_pyarrow()
is needed to support this).This way, users can use autocomplete to select both the dataset and the desired dataframe type.
To facilitate the transition:
Sza
).sza
) is provided as aPandas
dataframe, created usingto_pandas()
.If we decide to completely remove
Pandas
as a dependency in the future, the following tasks will be required:Pandas
dataframes, and rename the dataset classes to lowercase.__all__
.# remove pandas
for further cleanup.I’m confident there are other excellent approaches to tackle this issue, so please feel free to modify or reject this PR as needed.