sem_join does not handle cases where there are white-spaces or NULL values. #71

palak-463 · 2025-01-08T11:29:53Z

Describe the bug
The current implementation of the sem_join operator does not handle cases where there are whitespace-only or empty string values in the dataset. These values are being processed as valid data and included in the final join results, leading to incorrect and unexpected outputs. This issue arises when rows in the dataset have spaces or are entirely empty and different and incorrect results are generated everytime. Although the LLM too plays a role in this, it would be great if there could be a way to clean the data, since cleaning large datasets could cost some time.

Expected behavior
Data entries which contain values that are not semantically related to other values in the data frame, or the column name should be cleared out in order to avoid LLM hallucination when output is generated. Skipping those data entries could also avoid this issue, but it would further slow-down the computation process if there are too many irrelevant data entries in a large database.

Describe the solution you'd like
A new semantic operator could be proposed which would clean the redundant data and thereby produce better results. This operator could be applied along with any other semantic operators.

Describe alternatives you've considered
Tried to apply the sem_dedup and sem_sim_join operators but they failed to produce better results. The only way this issue can be avoided is by using a clean dataset which is not the most feasible option considering the dataset could be large and the cleaning process could cost a significant amount of time.

Additional context and screenshots

'n' along with white spaces is an accepted data value in both courses_data and skills_data, and is therefore a part of the output. The LLM hallucinates due to this and produces wrong output for other valid data entries as well.

Desktop

OS: Windows 10
Browser: Microsoft Edge
Version: 131.0.2903.112

palak-463 added the bug Something isn't working label Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sem_join does not handle cases where there are white-spaces or NULL values. #71

sem_join does not handle cases where there are white-spaces or NULL values. #71

palak-463 commented Jan 8, 2025 •

edited

Loading

sem_join does not handle cases where there are white-spaces or NULL values. #71

sem_join does not handle cases where there are white-spaces or NULL values. #71

Comments

palak-463 commented Jan 8, 2025 • edited Loading

palak-463 commented Jan 8, 2025 •

edited

Loading