Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix usage of whitespace tokenizer #410

Merged
merged 2 commits into from
Jan 29, 2024
Merged

Fix usage of whitespace tokenizer #410

merged 2 commits into from
Jan 29, 2024

Conversation

Jade-GG
Copy link
Collaborator

@Jade-GG Jade-GG commented Jan 25, 2024

Makes it a little more reliable, as the whitespace tokenizer made synonyms case sensitive.

Could also use the standard tokenizer, shouldn't really make a difference here.

Tokenizer reference

@indykoning
Copy link
Member

Nice! Just to be sure, lowercase is an extension of whitespace right? So no breaking changes are made

@Jade-GG
Copy link
Collaborator Author

Jade-GG commented Jan 29, 2024

Nice! Just to be sure, lowercase is an extension of whitespace right? So no breaking changes are made

Not exactly, my original usage of the whitespace tokenizer was actually a breaking change:

  • whitespace only splits queries when it encounters a whitespace character, e.g. aBc.dEf 4 gHi only gets turned into ['aBc.dEf', '4', 'gHi']
  • lowercase will split on any non-letter character, turning the same string into ['abc', 'def', 'ghi']

I do now realize that this means every number put into the search query gets removed... It might be better to just stick to the standard tokenizer even though it doesn't make the query case insensitive, but it would at least fix the accidental breaking change I made.

@Jade-GG Jade-GG changed the title Use lowercase tokenizer Fix usage of whitespace tokenizer Jan 29, 2024
@indykoning
Copy link
Member

indykoning commented Jan 29, 2024

That's a shame, might it be a good idea to make this configurable? Or does standard already cover most, if not all use cases we've come across?

@Jade-GG
Copy link
Collaborator Author

Jade-GG commented Jan 29, 2024

That's a shame, might it be a good idea to make this configurable? Or does standard already cover most, if not all use cases we've come across?

It's already kind of configurable if you use the eventy filters and set the mappings/settings manually.

We could also make it configurable globally but then we'd have to figure out a way to apply it to every mapping, rather than just the ones that get the synonym filter applied (which is what this is actually from). Could be an interesting story, but probably for another time 😅

Copy link
Member

@indykoning indykoning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they're accessible in the eventy filters that will be enough if someone would want to overwrite this setting for now.
If it happens often in the future we can always make changes to make that easier

@indykoning indykoning merged commit 431f80e into master Jan 29, 2024
25 of 26 checks passed
@Jade-GG Jade-GG deleted the tokenizer-lowercase branch February 7, 2024 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants