-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrame.sort_values() by 2 columns and a key function produces incorrect results #60673
Comments
I don't think this is a bug.
When you say In this example, you start with the Series Let's say the key function instead returned This is similar to how Python's data = list("DCBA")
def key(val):
# Equivalent to returning "ABCD" in pandas
return {"D": "A", "C": "B", "B": "C", "A": "D"}[val]
print(sorted(data, key=key))
# ['D', 'C', 'B', 'A'] |
@rhshadrach OK, I see your point and now understand how it is supposed to work. I think the docs aren't necessarily clear. For the
Maybe it should say: Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently. The values in the returned Series will be used as the keys for sorting. |
Yep, answer above there is correct the As stated, it looks like its executed column wise. The key function you'd be looking for would be
If you're looking to perform some kind of special sorting operation it might be useful to create a third column in this case and apply some kind of hashing function
|
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When providing a
key
argument tosort_values()
orsort_index()
, and specifying more than one column, the results are not sorted correctly.In the above code, the output is:
key
function. The results are first sorted on the columnlet
, then to break ties, sorted on the columnnum
Expected Behavior
The result of the sort with a
key
argument in this case should be the same as without the function. When specifying thekey
argument with more than one column, the result should be hierarchically sorted.Installed Versions
INSTALLED VERSIONS
commit : 0691c5c
python : 3.10.14
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.26100
machine : AMD64
processor : Intel64 Family 6 Model 183 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 2.2.3
numpy : 2.2.1
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.2
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : 1.1
hypothesis : None
gcsfs : None
jinja2 : 3.1.5
lxml.etree : 5.3.0
matplotlib : 3.8.4
numba : None
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 18.1.0
pyreadstat : 1.2.8
pytest : N/A
python-calamine : None
pyxlsb : 1.0.10
s3fs : None
scipy : 1.15.0
sqlalchemy : 2.0.36
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.11.0
xlrd : 2.0.1
xlsxwriter : 3.2.0
zstandard : None
tzdata : 2024.2
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: