E2e test alt typing #9

trey-stafford · 2024-08-14T19:50:29Z

A follow-on from #8 , which provides an alternative approach to using pandera that is "more" (?) compatible with mypy for static type-checking.

Rather than using pandera types directly, this PR subclasses pandas.DataFrame and calls validate using the appropriate pandera schema on init. This ensures that data instantiated from that class matches the schema.

The drawback is that we can't use pandera's decorators that run validators on function inputs/outputs. My feeling is that this is an OK tradeoff because we still validate the data at instantiation, and we get the benefits of static typechecking. On the other hand, this approach could introduce subtle errors when something is typed as IceFlowData, but then a pandas operation mutates the dataframe (e.g., dropping a required column).

So...there are pros/cons for this approach and the one given in #8 . Another option would be to have mypy ignore pandera types. This could lead to mistakes in typing, but runtime validation would run. Are there other approaches worth considering? Could stop using pandera and stick with writing functions that take e.g., individual series as arguments instead of a dataframe that we expect to contain certain fields, but I was hoping to avoid that.

trey-stafford · 2024-08-14T21:56:06Z

src/iceflow/ingest/atm1b.py

+        # Validate the data w/ pandera
+        # TODO: Does this result in pandera validating the common columns twice?
+        # The `super` call above would trigger the `IceFlowData`'s __init__,
+        # which include a call to `validate` on the common data columns.


Another potential issue here. I read that pandera will try to cache validation, but would want to confirm if that's happening here. Maybe there's a better approach, where the schema is passed into the class at instantiation time?

Worst case, could use some class state to avoid double-validation.

trey-stafford · 2024-08-14T22:01:02Z

src/iceflow/ingest/models.py

+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        # Validate the data w/ pandera
+        IceFlowDataSchema.validate(self)


Validation only happens at instantiation.

trey-stafford · 2024-08-14T22:01:44Z

src/iceflow/itrf/converter.py

 def transform_itrf(
-    data: DataFrame[commonDataColumns],
+    data: IceFlowData,


data here is typed with the parent class. Normally we'd expect dataset-specific subclasses to be passed in.

Can it be any subclass, or only specific ones? I.e. should we use union instead?

Any subclass.

trey-stafford · 2024-08-14T22:02:34Z

tests/test_e2e.py

@@ -83,7 +82,7 @@ def test_e2e(tmp_path):

    # This df contains data w/ two ITRFs: ITRF2005 and ITRF2008.
    complete_df = pd.concat(all_dfs)
-    complete_df = DataFrame_co[atm1bData](complete_df)
+    complete_df = ATM1BData(complete_df)


This looks more readable/understandable than what it was changed from, in my eyes. Maybe I'm just not used to using TypeVar.

You can give DataFrame_co[atm1bData] an alias, and I think we'd have the same readability. But I'm wondering now, why use both TypeVar and PEP695 generic syntax? I thought PEP695 was ruled out (similar comment in #8)

Hmm, an alias might be a nice way to handle this.

Tbh I'm still pretty fuzzy on PEP695 and how this is working. T

My fault, I'm being confused and confusing. Please ignore my comment on PEP695 here.

Not to worry - I'm confused too 🤣

mfisher87 · 2024-08-15T15:31:02Z

So...there are pros/cons for this approach and the one given in #8 . Another option would be to have mypy ignore pandera types. This could lead to mistakes in typing, but runtime validation would run. Are there other approaches worth considering? Could stop using pandera and stick with writing functions that take e.g., individual series as arguments instead of a dataframe that we expect to contain certain fields, but I was hoping to avoid that.

Oof, none of these options are great, but I need to remind myself that they're all better than passing around unknown dataframes.

mfisher87

I'm going to approve both approaches and not take a side. I can see value in both. But I think we should jump on a call and get a little more in-depth if you'd like a more informed opinion :) There's a lot to consider! And I also have a lot of gaps. Let me know how you want to go forward!

mfisher87 · 2024-08-15T15:43:33Z

src/iceflow/itrf/converter.py

 def transform_itrf(
-    data: DataFrame[commonDataColumns],
+    data: IceFlowData,


Can it be any subclass, or only specific ones? I.e. should we use union instead?

trey-stafford · 2024-08-16T17:49:09Z

Declining in favor of #8

trey-stafford added 3 commits August 14, 2024 11:12

Experiment w/ using subclassing & pandera instaed of just pandera

68115a9

Rename pandera classes w/ schema in the name

b95c492

Cleanup TODO

d28da77

trey-stafford mentioned this pull request Aug 14, 2024

E2E test #8

Merged

trey-stafford changed the base branch from main to e2e-test August 14, 2024 21:54

trey-stafford commented Aug 14, 2024

View reviewed changes

trey-stafford requested a review from mfisher87 August 14, 2024 22:02

trey-stafford marked this pull request as ready for review August 15, 2024 14:29

mfisher87 approved these changes Aug 15, 2024

View reviewed changes

trey-stafford closed this Aug 16, 2024

trey-stafford deleted the e2e-test-alt-typing branch August 16, 2024 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2e test alt typing #9

E2e test alt typing #9

trey-stafford commented Aug 14, 2024 •

edited

Loading

trey-stafford Aug 14, 2024

mfisher87 Aug 15, 2024

trey-stafford Aug 14, 2024

trey-stafford Aug 14, 2024

mfisher87 Aug 15, 2024

trey-stafford Aug 15, 2024

trey-stafford Aug 14, 2024

mfisher87 Aug 15, 2024 •

edited

Loading

trey-stafford Aug 15, 2024

mfisher87 Aug 15, 2024

trey-stafford Aug 15, 2024

mfisher87 commented Aug 15, 2024

mfisher87 left a comment •

edited

Loading

mfisher87 Aug 15, 2024

trey-stafford commented Aug 16, 2024

E2e test alt typing #9

E2e test alt typing #9

Conversation

trey-stafford commented Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfisher87 Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfisher87 commented Aug 15, 2024

mfisher87 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trey-stafford commented Aug 16, 2024

trey-stafford commented Aug 14, 2024 •

edited

Loading

mfisher87 Aug 15, 2024 •

edited

Loading

mfisher87 left a comment •

edited

Loading