Skip to content

Commit

Permalink
Add public APIs to Access Underlying cudf and pandas Objects from…
Browse files Browse the repository at this point in the history
… `cudf.pandas` Proxy Objects (#17629)

Fixes: #17524 
Fixes: rapidsai/cuml#6232
This PR introduces methods to access the real underlying `cudf` and `pandas` objects from `cudf.pandas` proxy objects. These methods ensure compatibility with libraries that are `cudf` or `pandas` aware.


This PR also gives a performance boost to `cudf-pandas` workflows, speeds from the script posted in rapidsai/cuml#6232:

`branch-25.02`:
```
cuML Label Encoder with cuDF-Pandas took 2.00794 seconds
```
`This PR`:
```
cuML Label Encoder with cuDF-Pandas took 0.09284 seconds
```


Changes:

- [x] Added `get_gpu_object()` and `get_cpu_object()` methods.
- [x] Updated faq.md with a section explaining how to use these methods.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Matthew Murray (https://github.com/Matt711)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #17629
  • Loading branch information
galipremsagar authored Jan 29, 2025
1 parent 367405f commit ed2f3c3
Show file tree
Hide file tree
Showing 6 changed files with 107 additions and 4 deletions.
47 changes: 47 additions & 0 deletions docs/cudf/source/cudf_pandas/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,53 @@ cuDF (learn more in [this
blog](https://medium.com/rapids-ai/easy-cpu-gpu-arrays-and-dataframes-run-your-dask-code-where-youd-like-e349d92351d)) and the [RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/)
provides a similar configuration-based plugin for Spark.

## How do I know if an object is a `cudf.pandas` proxy object?

To determine if an object is a `cudf.pandas` proxy object, you can use the `isinstance_cudf_pandas` API. This function checks if the given object is a proxy object that wraps either a `cudf` or `pandas` object. Here is an example of how to use this API:

```python
from cudf.pandas import isinstance_cudf_pandas

obj = ... # Your object here
if isinstance_cudf_pandas(obj, pd.Series):
print("The object is a cudf.pandas proxy Series object.")
else:
print("The object is not a cudf.pandas proxy Series object.")
```

To detect `Series`, `DataFrame`, `Index`, and `ndarray` objects separately, you can pass the type names as the second parameter:

* `isinstance_cudf_pandas(obj, pd.Series)`: Detects if the object is a `cudf.pandas` proxy `Series`.
* `isinstance_cudf_pandas(obj, pd.DataFrame)`: Detects if the object is a `cudf.pandas` proxy `DataFrame`.
* `isinstance_cudf_pandas(obj, pd.Index)`: Detects if the object is a `cudf.pandas` proxy `Index`.
* `isinstance_cudf_pandas(obj, np.ndarray)`: Detects if the object is a `cudf.pandas` proxy `ndarray`.

## How can I access the underlying GPU or CPU objects?

When working with `cudf.pandas` proxy objects, it is sometimes necessary to get true `cudf` or `pandas` objects that reside on GPU or CPU.
For example, this can be used to ensure that GPU-aware libraries that support both `cudf` and `pandas` can use the `cudf`-optimized code paths that keep data on GPU when processing `cudf.pandas` objects.
Otherwise, the library might use less-optimized CPU code because it thinks that the `cudf.pandas` object is a plain `pandas` dataframe.

The following methods can be used to retrieve the actual `cudf` or `pandas` objects:

- `as_gpu_object()`: This method returns the `cudf` object from the proxy.
- `as_cpu_object()`: This method returns the `pandas` object from the proxy.

If `as_gpu_object()` is called on a proxy array, it will return a `cupy` array and `as_cpu_object` will return a `numpy` array.

Here is an example of how to use these methods:

```python
# Assuming `proxy_obj` is a cudf.pandas proxy object
cudf_obj = proxy_obj.as_gpu_object()
pandas_obj = proxy_obj.as_cpu_object()

# Now you can use `cudf_obj` and `pandas_obj` with libraries that are cudf or pandas aware
```

Be aware that if `cudf.pandas` objects are converted to their underlying `cudf` or `pandas` types, the `cudf.pandas` proxy no longer controls them.
This means that automatic conversion between GPU and CPU types and automatic fallback from GPU to CPU functionality will not occur.

(are-there-any-known-limitations)=
## Are there any known limitations?

Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/core/column/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -1251,7 +1251,7 @@ def as_categorical_column(self, dtype) -> ColumnBase:
)

# Categories must be unique and sorted in ascending order.
cats = self.unique().sort_values().astype(self.dtype)
cats = self.unique().sort_values()
label_dtype = min_unsigned_type(len(cats))
labels = self._label_encoding(
cats=cats, dtype=label_dtype, na_sentinel=cudf.Scalar(1)
Expand Down
3 changes: 2 additions & 1 deletion python/cudf/cudf/pandas/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2023-2024, NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: Copyright (c) 2023-2025, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0

Expand All @@ -8,6 +8,7 @@
import pylibcudf
import rmm.mr

from ._wrappers.pandas import isinstance_cudf_pandas
from .fast_slow_proxy import is_proxy_object
from .magics import load_ipython_extension
from .profiler import Profiler
Expand Down
15 changes: 14 additions & 1 deletion python/cudf/cudf/pandas/_wrappers/pandas.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2023-2024, NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: Copyright (c) 2023-2025, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import abc
Expand Down Expand Up @@ -35,7 +35,9 @@
_fast_slow_function_call,
_FastSlowAttribute,
_FunctionProxy,
_maybe_wrap_result,
_Unusable,
is_proxy_object,
make_final_proxy_type as _make_final_proxy_type,
make_intermediate_proxy_type as _make_intermediate_proxy_type,
register_proxy_func,
Expand Down Expand Up @@ -266,6 +268,12 @@ def custom_repr_html(obj):
html_formatter.for_type(DataFrame, custom_repr_html)


def _Series_dtype(self):
# Fast-path to extract dtype from the current
# object without round-tripping through the slow<->fast
return _maybe_wrap_result(self._fsproxy_wrapped.dtype, None)


Series = make_final_proxy_type(
"Series",
cudf.Series,
Expand All @@ -285,6 +293,7 @@ def custom_repr_html(obj):
"_constructor": _FastSlowAttribute("_constructor"),
"_constructor_expanddim": _FastSlowAttribute("_constructor_expanddim"),
"_accessors": set(),
"dtype": _Series_dtype,
},
)

Expand Down Expand Up @@ -1704,6 +1713,10 @@ def holiday_calendar_factory_wrapper(*args, **kwargs):
)


def isinstance_cudf_pandas(obj, type):
return is_proxy_object(obj) and obj.__class__.__name__ == type.__name__


# timestamps and timedeltas are not proxied, but non-proxied
# pandas types are currently not picklable. Thus, we define
# custom reducer/unpicker functions for these types:
Expand Down
10 changes: 9 additions & 1 deletion python/cudf/cudf/pandas/fast_slow_proxy.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2023-2024, NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: Copyright (c) 2023-2025, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0

Expand Down Expand Up @@ -204,6 +204,12 @@ def _fsproxy_fast_to_slow(self):
return fast_to_slow(self._fsproxy_wrapped)
return self._fsproxy_wrapped

def as_gpu_object(self):
return self._fsproxy_slow_to_fast()

def as_cpu_object(self):
return self._fsproxy_fast_to_slow()

@property # type: ignore
def _fsproxy_state(self) -> _State:
return (
Expand All @@ -221,6 +227,8 @@ def _fsproxy_state(self) -> _State:
"_fsproxy_slow_type": slow_type,
"_fsproxy_slow_to_fast": _fsproxy_slow_to_fast,
"_fsproxy_fast_to_slow": _fsproxy_fast_to_slow,
"as_gpu_object": as_gpu_object,
"as_cpu_object": as_cpu_object,
"_fsproxy_state": _fsproxy_state,
}

Expand Down
34 changes: 34 additions & 0 deletions python/cudf/cudf_pandas_tests/test_cudf_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,10 @@
get_calendar,
)

from cudf.pandas import (
isinstance_cudf_pandas,
)

# Accelerated pandas has the real pandas and cudf modules as attributes
pd = xpd._fsproxy_slow
cudf = xpd._fsproxy_fast
Expand Down Expand Up @@ -1885,3 +1889,33 @@ def test_dataframe_setitem():
new_df = df + 1
df[df.columns] = new_df
tm.assert_equal(df, new_df)


def test_dataframe_get_fast_slow_methods():
df = xpd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]})
assert isinstance(df.as_gpu_object(), cudf.DataFrame)
assert isinstance(df.as_cpu_object(), pd.DataFrame)


def test_is_cudf_pandas():
s = xpd.Series([1, 2, 3])
df = xpd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]})
index = xpd.Index([1, 2, 3])

assert isinstance_cudf_pandas(s, pd.Series)
assert isinstance_cudf_pandas(df, pd.DataFrame)
assert isinstance_cudf_pandas(index, pd.Index)
assert isinstance_cudf_pandas(index.values, np.ndarray)

for obj in [s, df, index, index.values]:
assert not isinstance_cudf_pandas(obj._fsproxy_slow, pd.Series)
assert not isinstance_cudf_pandas(obj._fsproxy_fast, pd.Series)

assert not isinstance_cudf_pandas(obj._fsproxy_slow, pd.DataFrame)
assert not isinstance_cudf_pandas(obj._fsproxy_fast, pd.DataFrame)

assert not isinstance_cudf_pandas(obj._fsproxy_slow, pd.Index)
assert not isinstance_cudf_pandas(obj._fsproxy_fast, pd.Index)

assert not isinstance_cudf_pandas(obj._fsproxy_slow, np.ndarray)
assert not isinstance_cudf_pandas(obj._fsproxy_fast, np.ndarray)

0 comments on commit ed2f3c3

Please sign in to comment.