Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DataFrame.to_pandas generates duplicates #382

Open
cbeaujoin-stellar opened this issue Mar 1, 2024 · 4 comments
Open

[BUG] DataFrame.to_pandas generates duplicates #382

cbeaujoin-stellar opened this issue Mar 1, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@cbeaujoin-stellar
Copy link

cbeaujoin-stellar commented Mar 1, 2024

What is the bug?

DataFrame.to_pandas generates duplicates when an os_index_field is set and/or other than "_doc".

How can one reproduce the bug?

        opensearch_df = oml.DataFrame(client, index, columns=columns, os_index_field="@timestamp")
        index_df = opensearch_df.to_pandas(True)
        dup = index_df[index_df.duplicated(keep=False)]
        print(len(dup))

=> Loading index:
2024-03-01 16:02:58.179774: read 10000 rows
2024-03-01 16:03:07.520786: read 14895 rows
4930

What is the expected behavior?
opensearch_py_ml/operations.py:1229

    def to_pandas(
        self, query_compiler: "QueryCompiler", show_progress: bool = False
    ) -> pd.DataFrame:
...
        for df in self.search_yield_pandas_dataframes(query_compiler=query_compiler):

search_yield_pandas_dataframes should be called with sort_indexparameter set to os_index_field value defined in the oml.DataFrame

What is your host/environment?

  • OS: Linux / Windows (same behavior)
@cbeaujoin-stellar cbeaujoin-stellar added bug Something isn't working untriaged labels Mar 1, 2024
@dhrubo-os
Copy link
Collaborator

Hi @cbeaujoin-stellar, thanks for creating the issue. Please feel free to raise a PR if you want.

@cbeaujoin-stellar
Copy link
Author

Any update ?

@Yerzhaisang
Copy link
Contributor

Hi Dhrubo,

I attempted to reproduce the issue on OpenSearch 2.7 using the sample data, iterating over all columns and setting os_index_field for each one. However, after performing these checks, I didn’t encounter any duplicate rows.
oml_repr

Could you please review this and see if there’s anything I might have overlooked?

Thanks!

@Yerzhaisang
Copy link
Contributor

Hey @cbeaujoin-stellar , I hope you’re doing well.

Could you please take a look at this and try to reproduce the issue using some sample flight data?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants