[BUG] DataFrame.to_pandas generates duplicates #382

cbeaujoin-stellar · 2024-03-01T15:07:03Z

What is the bug?

DataFrame.to_pandas generates duplicates when an os_index_field is set and/or other than "_doc".

How can one reproduce the bug?

        opensearch_df = oml.DataFrame(client, index, columns=columns, os_index_field="@timestamp")
        index_df = opensearch_df.to_pandas(True)
        dup = index_df[index_df.duplicated(keep=False)]
        print(len(dup))

=> Loading index:
2024-03-01 16:02:58.179774: read 10000 rows
2024-03-01 16:03:07.520786: read 14895 rows
4930

What is the expected behavior?
opensearch_py_ml/operations.py:1229

    def to_pandas(
        self, query_compiler: "QueryCompiler", show_progress: bool = False
    ) -> pd.DataFrame:
...
        for df in self.search_yield_pandas_dataframes(query_compiler=query_compiler):

search_yield_pandas_dataframes should be called with sort_indexparameter set to os_index_field value defined in the oml.DataFrame

What is your host/environment?

OS: Linux / Windows (same behavior)

The text was updated successfully, but these errors were encountered:

dhrubo-os · 2024-03-07T01:13:35Z

Hi @cbeaujoin-stellar, thanks for creating the issue. Please feel free to raise a PR if you want.

cbeaujoin-stellar · 2024-06-05T13:31:15Z

Any update ?

Yerzhaisang · 2024-11-03T09:04:41Z

Hi Dhrubo,

I attempted to reproduce the issue on OpenSearch 2.7 using the sample data, iterating over all columns and setting os_index_field for each one. However, after performing these checks, I didn’t encounter any duplicate rows.

Could you please review this and see if there’s anything I might have overlooked?

Thanks!

Yerzhaisang · 2024-11-04T18:25:01Z

Hey @cbeaujoin-stellar , I hope you’re doing well.

Could you please take a look at this and try to reproduce the issue using some sample flight data?

Thanks!

cbeaujoin-stellar added bug Something isn't working untriaged labels Mar 1, 2024

Arnav-Gr0ver mentioned this issue Jun 6, 2024

added sort_index parameter to DataFrame.to_pandas function #392

Closed

5 tasks

dhrubo-os removed the untriaged label Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DataFrame.to_pandas generates duplicates #382

[BUG] DataFrame.to_pandas generates duplicates #382

cbeaujoin-stellar commented Mar 1, 2024 •

edited

Loading

dhrubo-os commented Mar 7, 2024

cbeaujoin-stellar commented Jun 5, 2024

Yerzhaisang commented Nov 3, 2024

Yerzhaisang commented Nov 4, 2024

[BUG] DataFrame.to_pandas generates duplicates #382

[BUG] DataFrame.to_pandas generates duplicates #382

Comments

cbeaujoin-stellar commented Mar 1, 2024 • edited Loading

dhrubo-os commented Mar 7, 2024

cbeaujoin-stellar commented Jun 5, 2024

Yerzhaisang commented Nov 3, 2024

Yerzhaisang commented Nov 4, 2024

cbeaujoin-stellar commented Mar 1, 2024 •

edited

Loading