Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework groupby and resample core modules #848

Merged
merged 23 commits into from
Feb 6, 2024
Merged

Conversation

hamdanal
Copy link
Contributor

@hamdanal hamdanal commented Jan 6, 2024

Firstly, my apologies for the lengthy PR.
I intended to fix df.groupby().resample() but it was very hard to do so given how much was missing in that area. I then went down the rabbit hole of running stubtest and fixing all allow list entries related to groupby and its methods. Hopefully the extensive testing I added and the happy stubtest will make it easier to review.

@hamdanal
Copy link
Contributor Author

hamdanal commented Jan 6, 2024

Just a note on stubtest:

Given that stubtest does not run as part of the CI, it generated 4073 errors on the main branch. With the changes in this PR, I got the number down to 3396 errors, that's almost 700 less errors. With some tweaking and organization, I got the number of allow list entries down to a manageable 840 entries. I would happily upload this allow list file and add stubtest to CI if you are willing. It will help the project stay up-to-date with upstream pandas and will prevent regressions in already fixed code.

@twoertwein twoertwein requested a review from Dr-Irv January 6, 2024 21:36
Copy link
Member

@twoertwein twoertwein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this giant PR!

Please wait for feedback from @Dr-Irv before making further changes.

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this big contribution. It's a hard one to review, so I may not have caught everything.

Some meta-level-notes:

  1. The initial version of these stubs was produced by using stubgen (before I was involved in the project), probably using pandas 1.1. We've been slowly removing things that are not documented. So if you use stubgen to generate things from the pandas source, we end up having to do those removals again.
  2. Using the pandas source to generate stubs has risks because the types within the pandas source code are meant to make sure that the types are consistent within pandas for pandas developers, but not necessarily are they appropriate for users. So we use the stubs as a way of providing type checking for the most common usage patterns of pandas. That also means we try to only include classes, methods and functions in the stubs that are documented, as well as parameters that are documented. One advantage of this is that it has helped us identify incorrect documentation in pandas and to have discussions on whether we want to accept certain types in various methods. So the rule of thumb is that the stubs are provided for what is documented, and we tend to be as narrow as we can to help users in the best way possible.
  3. If we need some helper classes, we make them private with a preceding underscore. We're probably not consistent, but that's an overall goal. One thing I noticed is that there are classes that can be returned by pandas (and are documented as such), but users should never instantiate - so we should not include __init__() in the stubs for those classes.
  4. You included NoDefault in many places -t that is used in the pandas source to tell users there is no default value, or that if you don't supply a value, there is an assumed default within pandas. So users should never pass NoDefault, so we should not include it in the stubs.
  5. I think the setting of default values of the parameters should be moved to a separate PR so as to minimize the changes coming from this PR.
  6. I don't think we should be using @final or any of the pandas decorators in the stubs. The latter is not documented. I don't see the value for including @final in the stubs that are meant for user code (but I'm open to an alternative point of view).

One other request:
7. Can we add a test for the issue reported in #810 (testing the engine keyword)? If you just do it for the specific example provided there, that's fine - no need to create a test for all methods that accept engine .

pandas-stubs/_typing.pyi Show resolved Hide resolved
pandas-stubs/core/apply.pyi Outdated Show resolved Hide resolved
pandas-stubs/core/base.pyi Outdated Show resolved Hide resolved
pandas-stubs/core/base.pyi Show resolved Hide resolved
pandas-stubs/core/frame.pyi Show resolved Hide resolved
pandas-stubs/plotting/_core.pyi Outdated Show resolved Hide resolved
pandas-stubs/util/_decorators.pyi Outdated Show resolved Hide resolved
tests/test_groupby.py Outdated Show resolved Hide resolved
tests/test_groupby.py Outdated Show resolved Hide resolved
tests/test_resampler.py Outdated Show resolved Hide resolved
@hamdanal
Copy link
Contributor Author

Thank you both for the review. I’ll get back to you soon with the answers and changes.

@hamdanal hamdanal marked this pull request as draft January 11, 2024 20:20
@hamdanal hamdanal marked this pull request as ready for review January 13, 2024 14:09
@hamdanal
Copy link
Contributor Author

I think I responded to / made the required changes for all the comments above except for this one about NoDefault

I don't think we should be adding NoDefault as a possible argument in the stubs, because a user really shouldn't pass that value. It's just used internally by the pandas code to say that if a value wasn't passed, we assume a specific default.

I am afraid that we have to disagree here. I believe that the benefit of running stubtest on public methods that use NoDefault is far more important than hiding the symbol from the user. This is demonstrated in a couple of places in this PR where stubtest helped us detect wrong function signature (example there is no squeeze parameter to DataFrame.groupy). Unfortunately, because the default value at runtime is an object of type NoDefault, stubtest will complain that the type of the argument does not cover the default value if this type is to be omitted and we will have to add the whole method to the allowlist.
Additionally, the symbol NoDefault is imported from an underscored module, this is enough indication to the user that it is private. It also has a descriptive name of what it does -- it is not very inviting for users to use it. I believe if you compare the pros and cons, the symbol should be used where needed.

I could've imported it as _NoDefault (or any underscored name you like) and used this "more private" name instead but I didn't because I saw it used without the underscore in other places already. If you prefer the underscored name, maybe we can change all its uses in a separate PR?

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jan 14, 2024

I could've imported it as _NoDefault (or any underscored name you like) and used this "more private" name instead but I didn't because I saw it used without the underscore in other places already. If you prefer the underscored name, maybe we can change all its uses in a separate PR?

I did a search, and the only place it is used is in the parameter dtype_backend in various I/O routines.

I see your point about the value of having stubtest pick up things that are incorrect in the stubs vs. the source. So let's fix in another PR. Created an issue #851 for this.

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty close. Found a few small things, and we still have some open discussions.

pandas-stubs/core/groupby/generic.pyi Outdated Show resolved Hide resolved
pandas-stubs/core/groupby/generic.pyi Outdated Show resolved Hide resolved
pandas-stubs/core/groupby/generic.pyi Outdated Show resolved Hide resolved
pandas-stubs/core/groupby/ops.pyi Outdated Show resolved Hide resolved
pandas-stubs/core/resample.pyi Outdated Show resolved Hide resolved
pandas-stubs/core/window/__init__.pyi Outdated Show resolved Hide resolved
Copy link
Member

@twoertwein twoertwein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't cross-check it with the documentation/pandas code but the changes overall look amazing!

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jan 17, 2024

@hamdanal let me know if you want me to do another review, or if you have more commits coming

@hamdanal
Copy link
Contributor Author

@hamdanal let me know if you want me to do another review, or if you have more commits coming

I addressed your previous comments. Feel free to do another review.

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks fine codewise, but there is a change to pyproject.toml that I don't think should be there.

pyproject.toml Outdated Show resolved Hide resolved
@hamdanal hamdanal marked this pull request as draft February 4, 2024 19:18
@hamdanal hamdanal marked this pull request as ready for review February 4, 2024 19:30
@hamdanal
Copy link
Contributor Author

hamdanal commented Feb 4, 2024

I reverted the pyproject.toml change and fixed some new warnings in the tests. I also pinned pyright since it changed some of its error codes and started failing in places unrelated to this PR.

@@ -252,7 +252,7 @@ class DataFrameGroupBy(GroupBy[DataFrame], Generic[ByT]):
) -> DataFrame: ...
@overload
def boxplot(
grouped,
grouped, # pyright: ignore[reportSelfClsParameterName]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use self here (and in the other overloads)?

pandas-stubs/core/groupby/generic.pyi Outdated Show resolved Hide resolved
pandas-stubs/core/groupby/generic.pyi Outdated Show resolved Hide resolved
pandas-stubs/core/groupby/generic.pyi Outdated Show resolved Hide resolved
Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hamdanal ! Lots of good work here

@Dr-Irv Dr-Irv merged commit e35c3ca into pandas-dev:main Feb 6, 2024
13 checks passed
@hamdanal hamdanal deleted the groupby branch February 6, 2024 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants