Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2e test alt typing #9

Closed
wants to merge 3 commits into from
Closed

E2e test alt typing #9

wants to merge 3 commits into from

Conversation

trey-stafford
Copy link
Contributor

@trey-stafford trey-stafford commented Aug 14, 2024

A follow-on from #8 , which provides an alternative approach to using pandera that is "more" (?) compatible with mypy for static type-checking.

Rather than using pandera types directly, this PR subclasses pandas.DataFrame and calls validate using the appropriate pandera schema on init. This ensures that data instantiated from that class matches the schema.

The drawback is that we can't use pandera's decorators that run validators on function inputs/outputs. My feeling is that this is an OK tradeoff because we still validate the data at instantiation, and we get the benefits of static typechecking. On the other hand, this approach could introduce subtle errors when something is typed as IceFlowData, but then a pandas operation mutates the dataframe (e.g., dropping a required column).

So...there are pros/cons for this approach and the one given in #8 . Another option would be to have mypy ignore pandera types. This could lead to mistakes in typing, but runtime validation would run. Are there other approaches worth considering? Could stop using pandera and stick with writing functions that take e.g., individual series as arguments instead of a dataframe that we expect to contain certain fields, but I was hoping to avoid that.

@trey-stafford trey-stafford mentioned this pull request Aug 14, 2024
@trey-stafford trey-stafford changed the base branch from main to e2e-test August 14, 2024 21:54
# Validate the data w/ pandera
# TODO: Does this result in pandera validating the common columns twice?
# The `super` call above would trigger the `IceFlowData`'s __init__,
# which include a call to `validate` on the common data columns.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential issue here. I read that pandera will try to cache validation, but would want to confirm if that's happening here. Maybe there's a better approach, where the schema is passed into the class at instantiation time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worst case, could use some class state to avoid double-validation.

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Validate the data w/ pandera
IceFlowDataSchema.validate(self)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation only happens at instantiation.

def transform_itrf(
data: DataFrame[commonDataColumns],
data: IceFlowData,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data here is typed with the parent class. Normally we'd expect dataset-specific subclasses to be passed in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be any subclass, or only specific ones? I.e. should we use union instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any subclass.

@@ -83,7 +82,7 @@ def test_e2e(tmp_path):

# This df contains data w/ two ITRFs: ITRF2005 and ITRF2008.
complete_df = pd.concat(all_dfs)
complete_df = DataFrame_co[atm1bData](complete_df)
complete_df = ATM1BData(complete_df)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more readable/understandable than what it was changed from, in my eyes. Maybe I'm just not used to using TypeVar.

Copy link
Contributor

@mfisher87 mfisher87 Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can give DataFrame_co[atm1bData] an alias, and I think we'd have the same readability. But I'm wondering now, why use both TypeVar and PEP695 generic syntax? I thought PEP695 was ruled out (similar comment in #8)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, an alias might be a nice way to handle this.

Tbh I'm still pretty fuzzy on PEP695 and how this is working. T

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fault, I'm being confused and confusing. Please ignore my comment on PEP695 here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to worry - I'm confused too 🤣

@trey-stafford trey-stafford marked this pull request as ready for review August 15, 2024 14:29
@mfisher87
Copy link
Contributor

So...there are pros/cons for this approach and the one given in #8 . Another option would be to have mypy ignore pandera types. This could lead to mistakes in typing, but runtime validation would run. Are there other approaches worth considering? Could stop using pandera and stick with writing functions that take e.g., individual series as arguments instead of a dataframe that we expect to contain certain fields, but I was hoping to avoid that.

Oof, none of these options are great, but I need to remind myself that they're all better than passing around unknown dataframes.

Copy link
Contributor

@mfisher87 mfisher87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to approve both approaches and not take a side. I can see value in both. But I think we should jump on a call and get a little more in-depth if you'd like a more informed opinion :) There's a lot to consider! And I also have a lot of gaps. Let me know how you want to go forward!

def transform_itrf(
data: DataFrame[commonDataColumns],
data: IceFlowData,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be any subclass, or only specific ones? I.e. should we use union instead?

@trey-stafford
Copy link
Contributor Author

Declining in favor of #8

@trey-stafford trey-stafford deleted the e2e-test-alt-typing branch August 16, 2024 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants