Fix usage of whitespace tokenizer #410

Jade-GG · 2024-01-25T13:25:43Z

Makes it a little more reliable, as the whitespace tokenizer made synonyms case sensitive.

Could also use the standard tokenizer, shouldn't really make a difference here.

Tokenizer reference

indykoning · 2024-01-29T09:52:55Z

Nice! Just to be sure, lowercase is an extension of whitespace right? So no breaking changes are made

Jade-GG · 2024-01-29T10:11:44Z

Nice! Just to be sure, lowercase is an extension of whitespace right? So no breaking changes are made

Not exactly, my original usage of the whitespace tokenizer was actually a breaking change:

whitespace only splits queries when it encounters a whitespace character, e.g. aBc.dEf 4 gHi only gets turned into ['aBc.dEf', '4', 'gHi']
lowercase will split on any non-letter character, turning the same string into ['abc', 'def', 'ghi']

I do now realize that this means every number put into the search query gets removed... It might be better to just stick to the standard tokenizer even though it doesn't make the query case insensitive, but it would at least fix the accidental breaking change I made.

indykoning · 2024-01-29T10:15:10Z

That's a shame, might it be a good idea to make this configurable? Or does standard already cover most, if not all use cases we've come across?

Jade-GG · 2024-01-29T10:33:19Z

That's a shame, might it be a good idea to make this configurable? Or does standard already cover most, if not all use cases we've come across?

It's already kind of configurable if you use the eventy filters and set the mappings/settings manually.

We could also make it configurable globally but then we'd have to figure out a way to apply it to every mapping, rather than just the ones that get the synonym filter applied (which is what this is actually from). Could be an interesting story, but probably for another time 😅

indykoning

If they're accessible in the eventy filters that will be enough if someone would want to overwrite this setting for now.
If it happens often in the future we can always make changes to make that easier

Use lowercase tokenizer

6048413

Jade-GG requested review from royduin and indykoning as code owners January 25, 2024 13:25

Use standard tokenizer instead

18cc786

Jade-GG changed the title ~~Use lowercase tokenizer~~ Fix usage of whitespace tokenizer Jan 29, 2024

indykoning approved these changes Jan 29, 2024

View reviewed changes

indykoning merged commit 431f80e into master Jan 29, 2024
25 of 26 checks passed

Jade-GG deleted the tokenizer-lowercase branch February 7, 2024 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix usage of whitespace tokenizer #410

Fix usage of whitespace tokenizer #410

Jade-GG commented Jan 25, 2024

indykoning commented Jan 29, 2024

Jade-GG commented Jan 29, 2024

indykoning commented Jan 29, 2024 •

edited

Loading

Jade-GG commented Jan 29, 2024

indykoning left a comment

Fix usage of whitespace tokenizer #410

Fix usage of whitespace tokenizer #410

Conversation

Jade-GG commented Jan 25, 2024

indykoning commented Jan 29, 2024

Jade-GG commented Jan 29, 2024

indykoning commented Jan 29, 2024 • edited Loading

Jade-GG commented Jan 29, 2024

indykoning left a comment

Choose a reason for hiding this comment

indykoning commented Jan 29, 2024 •

edited

Loading