Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sem_join does not handle cases where there are white-spaces or NULL values. #71

Open
palak-463 opened this issue Jan 8, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@palak-463
Copy link

palak-463 commented Jan 8, 2025

Describe the bug
The current implementation of the sem_join operator does not handle cases where there are whitespace-only or empty string values in the dataset. These values are being processed as valid data and included in the final join results, leading to incorrect and unexpected outputs. This issue arises when rows in the dataset have spaces or are entirely empty and different and incorrect results are generated everytime. Although the LLM too plays a role in this, it would be great if there could be a way to clean the data, since cleaning large datasets could cost some time.

Expected behavior
Data entries which contain values that are not semantically related to other values in the data frame, or the column name should be cleared out in order to avoid LLM hallucination when output is generated. Skipping those data entries could also avoid this issue, but it would further slow-down the computation process if there are too many irrelevant data entries in a large database.

Describe the solution you'd like
A new semantic operator could be proposed which would clean the redundant data and thereby produce better results. This operator could be applied along with any other semantic operators.

Describe alternatives you've considered
Tried to apply the sem_dedup and sem_sim_join operators but they failed to produce better results. The only way this issue can be avoided is by using a clean dataset which is not the most feasible option considering the dataset could be large and the cleaning process could cost a significant amount of time.

Additional context and screenshots
sem_join
output
'n' along with white spaces is an accepted data value in both courses_data and skills_data, and is therefore a part of the output. The LLM hallucinates due to this and produces wrong output for other valid data entries as well.

Desktop

  • OS: Windows 10
  • Browser: Microsoft Edge
  • Version: 131.0.2903.112
@palak-463 palak-463 added the bug Something isn't working label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant