Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] df.apply: add support for engine='bodo' #60622

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

scott-routledge2
Copy link

@scott-routledge2 scott-routledge2 commented Dec 30, 2024

Adds support for specifying engine="bodo" when executing UDF's on DataFrames via df.apply.

@WillAyd
Copy link
Member

WillAyd commented Dec 30, 2024

Is there an issue open to discuss this? Adding another engine internally is a non-trivial amount of work and maintenance, so I'm not sure we would even add this without discussion up front.

@scott-routledge2
Copy link
Author

Is there an issue open to discuss this? Adding another engine internally is a non-trivial amount of work and maintenance, so I'm not sure we would even add this without discussion up front.

@WillAyd Hi Will, thank you for the feedback, I would be happy to open an issue and start a conversation. The intention of this draft was mainly experimentation.

@WillAyd WillAyd added the Needs Discussion Requires discussion from core team before further action label Dec 31, 2024
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. I added some comments with ideas, but just minor things.

It'd still be good to have an issue for this (or maybe just update the description of the PR), so everybody understands in detail why this is being proposed, and what use cases this addresses. But in general, this seems good.

@@ -870,13 +870,16 @@ def apply(self) -> DataFrame | Series:
"the 'numba' engine doesn't support using "
"a string as the callable function"
)
if self.engine == "bodo":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.engine == "bodo":
elif self.engine == "bodo":

results, res_index = self.apply_series_numba()
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else:
elif self.engine == "else":

I know what you wrote is consistent with what we have now, where numba is the default for the if, but I think it's clearer to avoid a default (or make python the default). Maybe raise a ValueError if self.engine isn't known?

@@ -1089,6 +1098,26 @@ def apply_series_numba(self):
results = self.apply_with_numba()
return results, self.result_index

def apply_series_bodo(self) -> DataFrame | Series:
bodo = import_optional_dependency("bodo")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if importing bodo is immediate, but maybe better to have the checks that raise the errors first, and only import if those pass?

@@ -10288,6 +10288,8 @@ def apply(
<https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html>`_
in numba to learn what you can or cannot use in the passed function.

TODO: describe bodo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this comment to the review, so the TODO is not forgotten

frame = pd.DataFrame(
{"a": [1, 2, 3], "b": [4, 5, 6], "c": [7.0, 8.0, 9.0]},
)
f = lambda x: x["c"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why our linter doesn't complain, but I think it's considered a bad practice to use lambda when assigning to a variable. I'd use def instead.

@@ -3,6 +3,7 @@
name: pandas-dev
channels:
- conda-forge
- bodo.ai
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have plans to package Bodo for conda-forge? Is there a reason not to? I think it'd be better for users and our CI if we could simply use conda-forge.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we currently have an in progress PR here: conda-forge/staged-recipes#28648

@@ -0,0 +1,107 @@
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating separate tests for bodo is there a way that we can create a fixture for the three different engines? Structuring the tests that way would be very helpful to ensure result consistency

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this would be ideal. Part of the reason I structured it this way was that it made it easier to test bodo engine specific behavior e.g. error messages and to match tests for the numba engine. Though I did notice there was also already an engine fixture in test_frame_apply.py. I tried adding bodo to this fixture but found that it changed the behavior numba engine tests due to bodo installing some numba extensions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK thanks for the explanation. So do you think those behavioral changes are a bug with bodo or Numba?

W.r.t. the error messages, it would be nice if we catch engine specific errors and offer our own abstraction on top of this. You can see our existing errors in the pandas.errors module - some of those might be useable for issues here, or we may need to create a ExecutionError to abstract engine failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add support for executing UDF's using Bodo as the engine
3 participants