-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2e test alt typing #9
Conversation
# Validate the data w/ pandera | ||
# TODO: Does this result in pandera validating the common columns twice? | ||
# The `super` call above would trigger the `IceFlowData`'s __init__, | ||
# which include a call to `validate` on the common data columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential issue here. I read that pandera
will try to cache validation, but would want to confirm if that's happening here. Maybe there's a better approach, where the schema is passed into the class at instantiation time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worst case, could use some class state to avoid double-validation.
def __init__(self, *args, **kwargs): | ||
super().__init__(*args, **kwargs) | ||
# Validate the data w/ pandera | ||
IceFlowDataSchema.validate(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation only happens at instantiation.
def transform_itrf( | ||
data: DataFrame[commonDataColumns], | ||
data: IceFlowData, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data
here is typed with the parent class. Normally we'd expect dataset-specific subclasses to be passed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be any subclass, or only specific ones? I.e. should we use union instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any subclass.
@@ -83,7 +82,7 @@ def test_e2e(tmp_path): | |||
|
|||
# This df contains data w/ two ITRFs: ITRF2005 and ITRF2008. | |||
complete_df = pd.concat(all_dfs) | |||
complete_df = DataFrame_co[atm1bData](complete_df) | |||
complete_df = ATM1BData(complete_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks more readable/understandable than what it was changed from, in my eyes. Maybe I'm just not used to using TypeVar
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can give DataFrame_co[atm1bData]
an alias, and I think we'd have the same readability. But I'm wondering now, why use both TypeVar and PEP695 generic syntax? I thought PEP695 was ruled out (similar comment in #8)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, an alias might be a nice way to handle this.
Tbh I'm still pretty fuzzy on PEP695 and how this is working. T
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My fault, I'm being confused and confusing. Please ignore my comment on PEP695 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to worry - I'm confused too 🤣
Oof, none of these options are great, but I need to remind myself that they're all better than passing around unknown dataframes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to approve both approaches and not take a side. I can see value in both. But I think we should jump on a call and get a little more in-depth if you'd like a more informed opinion :) There's a lot to consider! And I also have a lot of gaps. Let me know how you want to go forward!
def transform_itrf( | ||
data: DataFrame[commonDataColumns], | ||
data: IceFlowData, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be any subclass, or only specific ones? I.e. should we use union instead?
Declining in favor of #8 |
A follow-on from #8 , which provides an alternative approach to using
pandera
that is "more" (?) compatible withmypy
for static type-checking.Rather than using
pandera
types directly, this PR subclassespandas.DataFrame
and callsvalidate
using the appropriatepandera
schema on init. This ensures that data instantiated from that class matches the schema.The drawback is that we can't use
pandera
's decorators that run validators on function inputs/outputs. My feeling is that this is an OK tradeoff because we still validate the data at instantiation, and we get the benefits of static typechecking. On the other hand, this approach could introduce subtle errors when something is typed asIceFlowData
, but then a pandas operation mutates the dataframe (e.g., dropping a required column).So...there are pros/cons for this approach and the one given in #8 . Another option would be to have
mypy
ignore pandera types. This could lead to mistakes in typing, but runtime validation would run. Are there other approaches worth considering? Could stop usingpandera
and stick with writing functions that take e.g., individual series as arguments instead of a dataframe that we expect to contain certain fields, but I was hoping to avoid that.