-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] df.apply
: add support for engine='bodo'
#60622
base: main
Are you sure you want to change the base?
[WIP] df.apply
: add support for engine='bodo'
#60622
Conversation
Is there an issue open to discuss this? Adding another engine internally is a non-trivial amount of work and maintenance, so I'm not sure we would even add this without discussion up front. |
@WillAyd Hi Will, thank you for the feedback, I would be happy to open an issue and start a conversation. The intention of this draft was mainly experimentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me. I added some comments with ideas, but just minor things.
It'd still be good to have an issue for this (or maybe just update the description of the PR), so everybody understands in detail why this is being proposed, and what use cases this addresses. But in general, this seems good.
pandas/core/apply.py
Outdated
@@ -870,13 +870,16 @@ def apply(self) -> DataFrame | Series: | |||
"the 'numba' engine doesn't support using " | |||
"a string as the callable function" | |||
) | |||
if self.engine == "bodo": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.engine == "bodo": | |
elif self.engine == "bodo": |
pandas/core/apply.py
Outdated
results, res_index = self.apply_series_numba() | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else: | |
elif self.engine == "else": |
I know what you wrote is consistent with what we have now, where numba
is the default for the if
, but I think it's clearer to avoid a default (or make python
the default). Maybe raise a ValueError
if self.engine
isn't known?
pandas/core/apply.py
Outdated
@@ -1089,6 +1098,26 @@ def apply_series_numba(self): | |||
results = self.apply_with_numba() | |||
return results, self.result_index | |||
|
|||
def apply_series_bodo(self) -> DataFrame | Series: | |||
bodo = import_optional_dependency("bodo") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if importing bodo is immediate, but maybe better to have the checks that raise the errors first, and only import if those pass?
@@ -10288,6 +10288,8 @@ def apply( | |||
<https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html>`_ | |||
in numba to learn what you can or cannot use in the passed function. | |||
|
|||
TODO: describe bodo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this comment to the review, so the TODO is not forgotten
frame = pd.DataFrame( | ||
{"a": [1, 2, 3], "b": [4, 5, 6], "c": [7.0, 8.0, 9.0]}, | ||
) | ||
f = lambda x: x["c"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why our linter doesn't complain, but I think it's considered a bad practice to use lambda when assigning to a variable. I'd use def
instead.
@@ -3,6 +3,7 @@ | |||
name: pandas-dev | |||
channels: | |||
- conda-forge | |||
- bodo.ai |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have plans to package Bodo for conda-forge? Is there a reason not to? I think it'd be better for users and our CI if we could simply use conda-forge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we currently have an in progress PR here: conda-forge/staged-recipes#28648
@@ -0,0 +1,107 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than creating separate tests for bodo is there a way that we can create a fixture for the three different engines? Structuring the tests that way would be very helpful to ensure result consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this would be ideal. Part of the reason I structured it this way was that it made it easier to test bodo engine specific behavior e.g. error messages and to match tests for the numba engine. Though I did notice there was also already an engine fixture in test_frame_apply.py
. I tried adding bodo to this fixture but found that it changed the behavior numba engine tests due to bodo installing some numba extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK thanks for the explanation. So do you think those behavioral changes are a bug with bodo or Numba?
W.r.t. the error messages, it would be nice if we catch engine specific errors and offer our own abstraction on top of this. You can see our existing errors in the pandas.errors module - some of those might be useable for issues here, or we may need to create a ExecutionError
to abstract engine failures
Adds support for specifying engine="bodo" when executing UDF's on DataFrames via
df.apply
.doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.