ENH: Change default dtype of str.get_dummies()
to bool for consistency with pd.get_dummies()
#60676
Open
1 of 3 tasks
Labels
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Currently,
pd.get_dummies()
returns a boolean dtype by default, whilestr.get_dummies()
returns an integer dtype (np.int64
). This inconsistency may cause confusion for users.To align the behavior, I propose changing the default dtype of
str.get_dummies()
to a boolean type (bool or boolean), matching the current behavior ofpd.get_dummies()
.Current Behavior of
pd.get_dummies()
vs.str.get_dummies()
pd.get_dummies()
:str.get_dummies()
:Note: The behavior described here is consistent in both pandas 2.2.3 and the current main branch (3.0.0.dev).
Background
pd.get_dummies()
was changed to bool.pd.get_dummies()
was updated to return a corresponding boolean dtype (e.g.,boolean[pyarrow]
) when the input is anExtensionArray
.However,
str.get_dummies()
has not yet been updated to match this behavior. Currently, it still defaults tonp.int64
.Detailed Comparison by Dtype
The following compares the results of using
pd.get_dummies()
andstr.get_dummies()
on Series with various dtypes. Only when thedtype
isstring[pyarrow]
do the results match, while in all other cases, the output types differ.Code:
Output:
Feature Description
1. Modify the default dtype in
str.get_dummies()
to return boolean values instead ofnp.int64
.Currently,
str.get_dummies()
sets the default dtype tonp.int64
in the following locations. These can be updated to usenp.bool_
for consistency:_str_get_dummies()
)_str_get_dummies()
)_str_get_dummies()
) – Already usingnp.bool_
2. If the input is an
ExtensionArray
, the method should return the corresponding boolean dtype (e.g.,boolean[pyarrow]
).The
pd.get_dummies()
method already determines the outputdtype
based on the input Seriesdtype
using the following logic. The same approach can be adapted forstr.get_dummies()
:_get_dummies_1d()
)Related PR:
I previously submitted a PR for this change, but the tests did not pass, and the PR has remained inactive since then. I am now opening this issue to clarify the problem and discuss potential solutions before proceeding with further modifications.
Alternative Solutions
No alternative solutions have been identified.
Additional Context
pd.get_dummies()
behavior) :pd.get_dummies
should returnbool[pyarrow]
types #56273The text was updated successfully, but these errors were encountered: