-
-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework groupby and resample core modules #848
Conversation
Just a note on stubtest: Given that stubtest does not run as part of the CI, it generated 4073 errors on the main branch. With the changes in this PR, I got the number down to 3396 errors, that's almost 700 less errors. With some tweaking and organization, I got the number of allow list entries down to a manageable 840 entries. I would happily upload this allow list file and add stubtest to CI if you are willing. It will help the project stay up-to-date with upstream pandas and will prevent regressions in already fixed code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this giant PR!
Please wait for feedback from @Dr-Irv before making further changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this big contribution. It's a hard one to review, so I may not have caught everything.
Some meta-level-notes:
- The initial version of these stubs was produced by using
stubgen
(before I was involved in the project), probably using pandas 1.1. We've been slowly removing things that are not documented. So if you usestubgen
to generate things from the pandas source, we end up having to do those removals again. - Using the pandas source to generate stubs has risks because the types within the pandas source code are meant to make sure that the types are consistent within pandas for pandas developers, but not necessarily are they appropriate for users. So we use the stubs as a way of providing type checking for the most common usage patterns of pandas. That also means we try to only include classes, methods and functions in the stubs that are documented, as well as parameters that are documented. One advantage of this is that it has helped us identify incorrect documentation in pandas and to have discussions on whether we want to accept certain types in various methods. So the rule of thumb is that the stubs are provided for what is documented, and we tend to be as narrow as we can to help users in the best way possible.
- If we need some helper classes, we make them private with a preceding underscore. We're probably not consistent, but that's an overall goal. One thing I noticed is that there are classes that can be returned by pandas (and are documented as such), but users should never instantiate - so we should not include
__init__()
in the stubs for those classes. - You included
NoDefault
in many places -t that is used in the pandas source to tell users there is no default value, or that if you don't supply a value, there is an assumed default within pandas. So users should never passNoDefault
, so we should not include it in the stubs. - I think the setting of default values of the parameters should be moved to a separate PR so as to minimize the changes coming from this PR.
- I don't think we should be using
@final
or any of the pandas decorators in the stubs. The latter is not documented. I don't see the value for including@final
in the stubs that are meant for user code (but I'm open to an alternative point of view).
One other request:
7. Can we add a test for the issue reported in #810 (testing the engine
keyword)? If you just do it for the specific example provided there, that's fine - no need to create a test for all methods that accept engine
.
Thank you both for the review. I’ll get back to you soon with the answers and changes. |
Needs fix everywhere and the upstream docs should be updated
I think I responded to / made the required changes for all the comments above except for this one about
I am afraid that we have to disagree here. I believe that the benefit of running stubtest on public methods that use I could've imported it as |
I did a search, and the only place it is used is in the parameter I see your point about the value of having |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty close. Found a few small things, and we still have some open discussions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't cross-check it with the documentation/pandas code but the changes overall look amazing!
@hamdanal let me know if you want me to do another review, or if you have more commits coming |
I addressed your previous comments. Feel free to do another review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks fine codewise, but there is a change to pyproject.toml
that I don't think should be there.
I reverted the pyproject.toml change and fixed some new warnings in the tests. I also pinned pyright since it changed some of its error codes and started failing in places unrelated to this PR. |
@@ -252,7 +252,7 @@ class DataFrameGroupBy(GroupBy[DataFrame], Generic[ByT]): | |||
) -> DataFrame: ... | |||
@overload | |||
def boxplot( | |||
grouped, | |||
grouped, # pyright: ignore[reportSelfClsParameterName] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use self
here (and in the other overloads)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hamdanal ! Lots of good work here
pandas.core.groupby.generic.DataFrameGroupBy.resample
#825, closescore.groupby.generic.DataFrameGroupBy
missing keyword-only argumentsengine
andengine_kwargs
#810, closesDataFrame.Groupby.Rolling
is missing proper typing #737, improves the situation for Create more precise annotations for DataFrame/Series GroupBy operations (agg, apply, transform) #456assert_type()
to assert the type of any return valueFirstly, my apologies for the lengthy PR.
I intended to fix
df.groupby().resample()
but it was very hard to do so given how much was missing in that area. I then went down the rabbit hole of running stubtest and fixing all allow list entries related to groupby and its methods. Hopefully the extensive testing I added and the happy stubtest will make it easier to review.