You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The current implementation of the sem_join operator does not handle cases where there are whitespace-only or empty string values in the dataset. These values are being processed as valid data and included in the final join results, leading to incorrect and unexpected outputs. This issue arises when rows in the dataset have spaces or are entirely empty and different and incorrect results are generated everytime. Although the LLM too plays a role in this, it would be great if there could be a way to clean the data, since cleaning large datasets could cost some time.
Expected behavior
Data entries which contain values that are not semantically related to other values in the data frame, or the column name should be cleared out in order to avoid LLM hallucination when output is generated. Skipping those data entries could also avoid this issue, but it would further slow-down the computation process if there are too many irrelevant data entries in a large database.
Describe the solution you'd like
A new semantic operator could be proposed which would clean the redundant data and thereby produce better results. This operator could be applied along with any other semantic operators.
Describe alternatives you've considered
Tried to apply the sem_dedup and sem_sim_join operators but they failed to produce better results. The only way this issue can be avoided is by using a clean dataset which is not the most feasible option considering the dataset could be large and the cleaning process could cost a significant amount of time.
Additional context and screenshots
'n' along with white spaces is an accepted data value in both courses_data and skills_data, and is therefore a part of the output. The LLM hallucinates due to this and produces wrong output for other valid data entries as well.
Desktop
OS: Windows 10
Browser: Microsoft Edge
Version: 131.0.2903.112
The text was updated successfully, but these errors were encountered:
Describe the bug
The current implementation of the sem_join operator does not handle cases where there are whitespace-only or empty string values in the dataset. These values are being processed as valid data and included in the final join results, leading to incorrect and unexpected outputs. This issue arises when rows in the dataset have spaces or are entirely empty and different and incorrect results are generated everytime. Although the LLM too plays a role in this, it would be great if there could be a way to clean the data, since cleaning large datasets could cost some time.
Expected behavior
Data entries which contain values that are not semantically related to other values in the data frame, or the column name should be cleared out in order to avoid LLM hallucination when output is generated. Skipping those data entries could also avoid this issue, but it would further slow-down the computation process if there are too many irrelevant data entries in a large database.
Describe the solution you'd like
A new semantic operator could be proposed which would clean the redundant data and thereby produce better results. This operator could be applied along with any other semantic operators.
Describe alternatives you've considered
Tried to apply the sem_dedup and sem_sim_join operators but they failed to produce better results. The only way this issue can be avoided is by using a clean dataset which is not the most feasible option considering the dataset could be large and the cleaning process could cost a significant amount of time.
Additional context and screenshots
'n' along with white spaces is an accepted data value in both courses_data and skills_data, and is therefore a part of the output. The LLM hallucinates due to this and produces wrong output for other valid data entries as well.
Desktop
The text was updated successfully, but these errors were encountered: