Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Change default dtype of str.get_dummies() to bool for consistency with pd.get_dummies() #60676

Open
1 of 3 tasks
komo-fr opened this issue Jan 8, 2025 · 0 comments
Open
1 of 3 tasks
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@komo-fr
Copy link

komo-fr commented Jan 8, 2025

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Currently, pd.get_dummies() returns a boolean dtype by default, while str.get_dummies() returns an integer dtype ( np.int64 ). This inconsistency may cause confusion for users.
To align the behavior, I propose changing the default dtype of str.get_dummies() to a boolean type (bool or boolean), matching the current behavior of pd.get_dummies().

Current Behavior of pd.get_dummies() vs. str.get_dummies()

pd.get_dummies():

import pandas as pd
sr = pd.Series(["A", "B", "A"])
pd.get_dummies(sr)
	A	B
0	True	False
1	False	True
2	True	False

str.get_dummies():

sr.str.get_dummies()
	A	B
0	1	0
1	0	1
2	1	0

Note: The behavior described here is consistent in both pandas 2.2.3 and the current main branch (3.0.0.dev).

Background

However, str.get_dummies() has not yet been updated to match this behavior. Currently, it still defaults to np.int64.

Detailed Comparison by Dtype

The following compares the results of using pd.get_dummies() and str.get_dummies() on Series with various dtypes. Only when the dtype is string[pyarrow] do the results match, while in all other cases, the output types differ.

Code:

import numpy as np
import pandas as pd
import pyarrow as pa

sr_list = [pd.Series(["A", "B", "A"]),
           pd.Series(["A", "B", "A"], dtype=pd.StringDtype()),
           pd.Series(["A", "B", "A"], dtype=pd.StringDtype("pyarrow")),
           pd.Series(["A", "B", "A"], dtype=pd.StringDtype("pyarrow_numpy")),
           pd.Series(["A", "B", "A"], dtype=pd.ArrowDtype(pa.string())),
           pd.Series(["A", "B", "A"], dtype="category"),
           pd.Series(["A", "B", "A"], dtype=pd.CategoricalDtype(pd.Index(["A", "B"], dtype=pd.ArrowDtype(pa.string()))))
]

for i, sr in enumerate(sr_list):
    print(f"----- case {i}. {sr.dtype=} -----")
    print(f"pd.get_dummies: {pd.get_dummies(sr)['A'].dtype}")
    print(f"str.get_dummies: {sr.str.get_dummies()['A'].dtype}")

Output:

----- case 0. sr.dtype=dtype('O') -----
pd.get_dummies: bool
str.get_dummies: int64
----- case 1. sr.dtype=string[python] -----
pd.get_dummies: boolean
str.get_dummies: int64
----- case 2. sr.dtype=string[pyarrow] -----
pd.get_dummies: boolean
str.get_dummies: int64
----- case 3. sr.dtype=string[pyarrow_numpy] -----
pd.get_dummies: bool
str.get_dummies: int64
----- case 4. sr.dtype=string[pyarrow] -----
pd.get_dummies: bool[pyarrow]
str.get_dummies: bool[pyarrow]
----- case 5. sr.dtype=CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=object) -----
pd.get_dummies: bool
str.get_dummies: int64
----- case 6. sr.dtype=CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=string[pyarrow]) -----
pd.get_dummies: bool[pyarrow]
str.get_dummies: int64

Feature Description

1. Modify the default dtype in str.get_dummies() to return boolean values instead of np.int64.

Currently, str.get_dummies() sets the default dtype to np.int64 in the following locations. These can be updated to use np.bool_ for consistency:

2. If the input is an ExtensionArray, the method should return the corresponding boolean dtype (e.g., boolean[pyarrow]).

The pd.get_dummies() method already determines the output dtype based on the input Series dtype using the following logic. The same approach can be adapted for str.get_dummies() :

Related PR:
I previously submitted a PR for this change, but the tests did not pass, and the PR has remained inactive since then. I am now opening this issue to clarify the problem and discuss potential solutions before proceeding with further modifications.

Alternative Solutions

No alternative solutions have been identified.

Additional Context

@komo-fr komo-fr added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

1 participant