You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Applying User Defined Functions (UDFs) to a DataFrame can be very slow when evaluated using the default python engine. Passing engine="numba" and leveraging Numba's Just-in-Time (JIT) compiler to transform the UDF application into an optimized binary can improve performance, however there are several limitations to the Numba UDF engine including:
Limited set of dtypes supported (only supports numpy dtypes, does not support ExtensionDtypes)
Parallel execution not supported (unless raw=True)
Difficulty troubleshooting issues due to lengthy stack traces and hard-to-read error messages.
Adding support for the Bodo engine would solve the above issues and provide a good complement to the capabilities of the currently supported engines (Python and Numba).
Bodo uses an auto-parallelizing JIT compiler to transform Python code into highly optimized, parallel binaries with an MPI backend, allowing it to scale to very large data sizes with minimal extra work required from the user (large speedups on both laptops and clusters). Bodo is also built for Pandas and supports DataFrame, Series and Array Extension types natively.
Feature Description
Allow passing the value "bodo" to the engine parameter in DataFrame.apply and add an apply_bodo method which accepts the user defined function and creates a jit function to do the apply and calls it. For example:
In pandas/core/apply.py
While this approach is fine, it has it's downsides such as requiring a larger code rewrite which could make it more difficult to quickly experiment with different engines.
@jbrockmendel while in general I agree with your point, we already have the numba engine in pandas. Do you think we should remove it?
For what I understand, seems like Bodo should be better for most users, as it works with Arrow types (besides the other advantages discussed). So, while I'm a big fan of not adding more things into pandas, I don't see why we should have numba and not bodo. Is there a reason?
Also, assuming Bodo is included, are there reasons to keep numba? When is numba a better choice than Bodo?
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Applying User Defined Functions (UDFs) to a DataFrame can be very slow when evaluated using the default python engine. Passing
engine="numba"
and leveraging Numba's Just-in-Time (JIT) compiler to transform the UDF application into an optimized binary can improve performance, however there are several limitations to the Numba UDF engine including:raw=True
)Adding support for the Bodo engine would solve the above issues and provide a good complement to the capabilities of the currently supported engines (Python and Numba).
Bodo uses an auto-parallelizing JIT compiler to transform Python code into highly optimized, parallel binaries with an MPI backend, allowing it to scale to very large data sizes with minimal extra work required from the user (large speedups on both laptops and clusters). Bodo is also built for Pandas and supports DataFrame, Series and Array Extension types natively.
Feature Description
Allow passing the value
"bodo"
to theengine
parameter inDataFrame.apply
and add anapply_bodo
method which accepts the user defined function and creates a jit function to do the apply and calls it. For example:In
pandas/core/apply.py
This approach could also be applied to other API's that accepts a UDF and engine argument.
Alternative Solutions
Users could execute their UDF using a Bodo JIT'd function. For example:
While this approach is fine, it has it's downsides such as requiring a larger code rewrite which could make it more difficult to quickly experiment with different engines.
Additional Context
Relevant links:
Bodo's documentation
Bodo's github repo
Proof-of-concept PR that adds support for
engine="bodo"
indf.apply
.The text was updated successfully, but these errors were encountered: