Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.sort_values() by 2 columns and a key function produces incorrect results #60673

Open
3 tasks done
Dr-Irv opened this issue Jan 7, 2025 · 3 comments
Open
3 tasks done
Labels
Docs Sorting e.g. sort_index, sort_values

Comments

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 7, 2025

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame.from_records(
    [[let, num] for let in "DCBA" for num in [2, 1]], columns=["let", "num"]
)
print(df)

r1 = df.sort_values(["let", "num"])
print(r1)


def key_func(s: pd.Series) -> pd.Series:
    result = s.sort_values()
    return result


r2 = df.sort_values(["let", "num"], key=key_func)
print(r2)

Issue Description

When providing a key argument to sort_values() or sort_index(), and specifying more than one column, the results are not sorted correctly.

In the above code, the output is:

  let  num
0   D    2
1   D    1
2   C    2
3   C    1
4   B    2
5   B    1
6   A    2
7   A    1
  let  num
7   A    1
6   A    2
5   B    1
4   B    2
3   C    1
2   C    2
1   D    1
0   D    2
  let  num
0   D    2
1   D    1
2   C    2
3   C    1
4   B    2
5   B    1
6   A    2
7   A    1
  • The first DF is the original DF
  • The second DF is the sorted DF without a key function. The results are first sorted on the column let, then to break ties, sorted on the column num
  • The third DF is the result of using a key function that sorts each column (based on the specification in the API - the function has to return a sorted column). The result is a DF that is not sorted. It should be the same as the second DF

Expected Behavior

The result of the sort with a key argument in this case should be the same as without the function. When specifying the key argument with more than one column, the result should be hierarchically sorted.

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.10.14
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.26100
machine : AMD64
processor : Intel64 Family 6 Model 183 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 2.2.3
numpy : 2.2.1
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.2
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : 1.1
hypothesis : None
gcsfs : None
jinja2 : 3.1.5
lxml.etree : 5.3.0
matplotlib : 3.8.4
numba : None
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 18.1.0
pyreadstat : 1.2.8
pytest : N/A
python-calamine : None
pyxlsb : 1.0.10
s3fs : None
scipy : 1.15.0
sqlalchemy : 2.0.36
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.11.0
xlrd : 2.0.1
xlsxwriter : 3.2.0
zstandard : None
tzdata : 2024.2
qtpy : None
pyqt5 : None

@Dr-Irv Dr-Irv added Bug Needs Triage Issue that has not been reviewed by a pandas team member Sorting e.g. sort_index, sort_values labels Jan 7, 2025
@rhshadrach
Copy link
Member

I don't think this is a bug.

(based on the specification in the API - the function has to return a sorted column)

When you say the function, are you referring to sort_values or key?

In this example, you start with the Series D D C C B B A A and your key function returns A A B B C C D D. It is these latter values that pandas sorts, and where it moves these values is where your original values also move. Since key returns something that is already sorted, nothing moves, and so you get your original Series back.

Let's say the key function instead returned A B A B C C D D. Then in sorting this array, indices 1 and 2 switch. Thus the return from sort_values will be D C D C B B A A - i.e. indices 1 and 2 are switched.

This is similar to how Python's key works.

data = list("DCBA")

def key(val):
    # Equivalent to returning "ABCD" in pandas
    return {"D": "A", "C": "B", "B": "C", "A": "D"}[val]

print(sorted(data, key=key))
# ['D', 'C', 'B', 'A']

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 7, 2025

@rhshadrach OK, I see your point and now understand how it is supposed to work. I think the docs aren't necessarily clear. For the key argument, it says:

Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.

Maybe it should say:

Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently. The values in the returned Series will be used as the keys for sorting.

@henrytseng
Copy link

henrytseng commented Jan 7, 2025

Yep, answer above there is correct the key func is really a mapping function rather than the function executed hence the result you receive.

As stated, it looks like its executed column wise.

The key function you'd be looking for would be

def key_func(s: pd.Series) -> pd.Series:
    return s.values

If you're looking to perform some kind of special sorting operation it might be useful to create a third column in this case and apply some kind of hashing function

r3['sort_key'] = r3.apply(lambda n: f"{n['let'].lower()}.{n['num']}", axis=1)
r3.sort_values('sort_key').drop('sort_key', axis=1)

@rhshadrach rhshadrach added Docs and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Sorting e.g. sort_index, sort_values
Projects
None yet
Development

No branches or pull requests

3 participants