Rework groupby and resample core modules #848

hamdanal · 2024-01-06T20:24:53Z

Closes Return type from pandas.core.groupby.generic.DataFrameGroupBy.resample #825, closes core.groupby.generic.DataFrameGroupBy missing keyword-only arguments engine and engine_kwargs #810, closes DataFrame.Groupby.Rolling is missing proper typing #737, improves the situation for Create more precise annotations for DataFrame/Series GroupBy operations (agg, apply, transform) #456
Tests added: Please use assert_type() to assert the type of any return value

Firstly, my apologies for the lengthy PR.
I intended to fix df.groupby().resample() but it was very hard to do so given how much was missing in that area. I then went down the rabbit hole of running stubtest and fixing all allow list entries related to groupby and its methods. Hopefully the extensive testing I added and the happy stubtest will make it easier to review.

hamdanal · 2024-01-06T20:27:05Z

Just a note on stubtest:

Given that stubtest does not run as part of the CI, it generated 4073 errors on the main branch. With the changes in this PR, I got the number down to 3396 errors, that's almost 700 less errors. With some tweaking and organization, I got the number of allow list entries down to a manageable 840 entries. I would happily upload this allow list file and add stubtest to CI if you are willing. It will help the project stay up-to-date with upstream pandas and will prevent regressions in already fixed code.

pandas-stubs/_typing.pyi

pandas-stubs/core/apply.pyi

pandas-stubs/core/generic.pyi

pandas-stubs/core/groupby/generic.pyi

pandas-stubs/core/groupby/grouper.pyi

twoertwein

Thank you very much for this giant PR!

Please wait for feedback from @Dr-Irv before making further changes.

Dr-Irv

Thanks for this big contribution. It's a hard one to review, so I may not have caught everything.

Some meta-level-notes:

The initial version of these stubs was produced by using stubgen (before I was involved in the project), probably using pandas 1.1. We've been slowly removing things that are not documented. So if you use stubgen to generate things from the pandas source, we end up having to do those removals again.
Using the pandas source to generate stubs has risks because the types within the pandas source code are meant to make sure that the types are consistent within pandas for pandas developers, but not necessarily are they appropriate for users. So we use the stubs as a way of providing type checking for the most common usage patterns of pandas. That also means we try to only include classes, methods and functions in the stubs that are documented, as well as parameters that are documented. One advantage of this is that it has helped us identify incorrect documentation in pandas and to have discussions on whether we want to accept certain types in various methods. So the rule of thumb is that the stubs are provided for what is documented, and we tend to be as narrow as we can to help users in the best way possible.
If we need some helper classes, we make them private with a preceding underscore. We're probably not consistent, but that's an overall goal. One thing I noticed is that there are classes that can be returned by pandas (and are documented as such), but users should never instantiate - so we should not include __init__() in the stubs for those classes.
You included NoDefault in many places -t that is used in the pandas source to tell users there is no default value, or that if you don't supply a value, there is an assumed default within pandas. So users should never pass NoDefault, so we should not include it in the stubs.
I think the setting of default values of the parameters should be moved to a separate PR so as to minimize the changes coming from this PR.
I don't think we should be using @final or any of the pandas decorators in the stubs. The latter is not documented. I don't see the value for including @final in the stubs that are meant for user code (but I'm open to an alternative point of view).

One other request:
7. Can we add a test for the issue reported in #810 (testing the engine keyword)? If you just do it for the specific example provided there, that's fine - no need to create a test for all methods that accept engine .

pandas-stubs/_typing.pyi

pandas-stubs/core/apply.pyi

pandas-stubs/core/base.pyi

pandas-stubs/core/frame.pyi

pandas-stubs/plotting/_core.pyi

pandas-stubs/util/_decorators.pyi

tests/test_groupby.py

tests/test_resampler.py

hamdanal · 2024-01-11T20:20:01Z

Thank you both for the review. I’ll get back to you soon with the answers and changes.

Needs fix everywhere and the upstream docs should be updated

hamdanal · 2024-01-13T14:25:45Z

I think I responded to / made the required changes for all the comments above except for this one about NoDefault

I don't think we should be adding NoDefault as a possible argument in the stubs, because a user really shouldn't pass that value. It's just used internally by the pandas code to say that if a value wasn't passed, we assume a specific default.

I am afraid that we have to disagree here. I believe that the benefit of running stubtest on public methods that use NoDefault is far more important than hiding the symbol from the user. This is demonstrated in a couple of places in this PR where stubtest helped us detect wrong function signature (example there is no squeeze parameter to DataFrame.groupy). Unfortunately, because the default value at runtime is an object of type NoDefault, stubtest will complain that the type of the argument does not cover the default value if this type is to be omitted and we will have to add the whole method to the allowlist.
Additionally, the symbol NoDefault is imported from an underscored module, this is enough indication to the user that it is private. It also has a descriptive name of what it does -- it is not very inviting for users to use it. I believe if you compare the pros and cons, the symbol should be used where needed.

I could've imported it as _NoDefault (or any underscored name you like) and used this "more private" name instead but I didn't because I saw it used without the underscore in other places already. If you prefer the underscored name, maybe we can change all its uses in a separate PR?

Dr-Irv · 2024-01-14T17:28:44Z

I could've imported it as _NoDefault (or any underscored name you like) and used this "more private" name instead but I didn't because I saw it used without the underscore in other places already. If you prefer the underscored name, maybe we can change all its uses in a separate PR?

I did a search, and the only place it is used is in the parameter dtype_backend in various I/O routines.

I see your point about the value of having stubtest pick up things that are incorrect in the stubs vs. the source. So let's fix in another PR. Created an issue #851 for this.

Dr-Irv

This is pretty close. Found a few small things, and we still have some open discussions.

pandas-stubs/core/groupby/generic.pyi

pandas-stubs/core/groupby/ops.pyi

pandas-stubs/core/resample.pyi

pandas-stubs/core/window/__init__.pyi

twoertwein

I didn't cross-check it with the documentation/pandas code but the changes overall look amazing!

Dr-Irv · 2024-01-17T22:24:57Z

@hamdanal let me know if you want me to do another review, or if you have more commits coming

hamdanal · 2024-01-19T19:44:37Z

@hamdanal let me know if you want me to do another review, or if you have more commits coming

I addressed your previous comments. Feel free to do another review.

Dr-Irv

Everything looks fine codewise, but there is a change to pyproject.toml that I don't think should be there.

pyproject.toml

hamdanal · 2024-02-04T19:34:26Z

I reverted the pyproject.toml change and fixed some new warnings in the tests. I also pinned pyright since it changed some of its error codes and started failing in places unrelated to this PR.

Dr-Irv · 2024-02-06T15:48:53Z

pandas-stubs/core/groupby/generic.pyi

@@ -252,7 +252,7 @@ class DataFrameGroupBy(GroupBy[DataFrame], Generic[ByT]):
    ) -> DataFrame: ...
    @overload
    def boxplot(
-        grouped,
+        grouped,  # pyright: ignore[reportSelfClsParameterName]


Why not just use self here (and in the other overloads)?

pandas-stubs/core/groupby/generic.pyi

Dr-Irv

Thanks @hamdanal ! Lots of good work here

Rework groupby and resample core modules

c5a98cf

Runtime Series is not generic

0ec4cbc

twoertwein requested a review from Dr-Irv January 6, 2024 21:36