-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Ascii folding token filter #7912
Changes from 8 commits
e21dc33
8818f91
dd2954c
eb72b4b
46c7019
4c17f70
f75e989
3c66cf5
e773672
629ae69
0d614e2
34d3486
cc190c2
09d3a1c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,135 @@ | ||
--- | ||
layout: default | ||
title: ASCIIFolding | ||
parent: Token filters | ||
nav_order: 20 | ||
--- | ||
|
||
# ASCIIFolding token filter | ||
Check failure on line 8 in _analyzers/token-filters/asciifolding.md GitHub Actions / style-job
Check failure on line 8 in _analyzers/token-filters/asciifolding.md GitHub Actions / style-job
|
||
|
||
`asciifolding` is a token filter that converts non-ASCII characters into their closest ASCII equivalents. For example *é* becomes *e*, *ü* becomes *u* and *ñ* becomes *n*. This process is also known as *transliteration*. | ||
AntonEliatra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
`asciifolding` token filter offers a number of benefits: | ||
AntonEliatra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- __Enhanced Search Flexibility__: Users often omit accents or special characters when typing queries. ASCIIFolding ensures that such queries still return relevant results. | ||
Check failure on line 15 in _analyzers/token-filters/asciifolding.md GitHub Actions / style-job
Check failure on line 15 in _analyzers/token-filters/asciifolding.md GitHub Actions / style-job
|
||
AntonEliatra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- __Normalization__: Standardizes the indexing process by ensuring that accented characters are consistently converted to their ASCII equivalents. | ||
AntonEliatra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- __Internationalization__: Particularly useful for applications dealing with multiple languages and character sets. | ||
AntonEliatra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
*Loss of Information*: While ASCIIFolding can simplify searches, it might also lead to loss of specific information, particularly if the distinction between accented and non-accented characters is significant in the dataset. | ||
Check failure on line 19 in _analyzers/token-filters/asciifolding.md GitHub Actions / style-job
|
||
AntonEliatra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
{: .warning} | ||
|
||
## Parameters | ||
|
||
You can configure `asciifolding` token filter using parameter `preserve_original`. Setting this option to `true` keeps both the original token and the ASCII-folded version in the token stream. This can be particularly useful in scenarios where you want to match both the original (with accents) and the normalized (without accents) versions of a term in search queries. Default is `false`. | ||
AntonEliatra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Example | ||
|
||
Following example request creates a new index named `example_index` and defines an analyzer with the `asciifolding` filter and `preserve_original` parameter set to `true`: | ||
AntonEliatra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```json | ||
PUT /example_index | ||
{ | ||
"settings": { | ||
"analysis": { | ||
"filter": { | ||
"custom_ascii_folding": { | ||
"type": "asciifolding", | ||
"preserve_original": true | ||
} | ||
}, | ||
"analyzer": { | ||
"custom_ascii_analyzer": { | ||
"type": "custom", | ||
"tokenizer": "standard", | ||
"filter": [ | ||
"lowercase", | ||
"custom_ascii_folding" | ||
] | ||
} | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
## Generated tokens | ||
|
||
Use the following request to examine the tokens generated using the created analyzer: | ||
AntonEliatra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```json | ||
POST /example_index/_analyze | ||
{ | ||
"analyzer": "custom_ascii_analyzer", | ||
"text": "Résumé café naïve coördinate" | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
The response contains the generated tokens: | ||
|
||
```json | ||
{ | ||
"tokens": [ | ||
{ | ||
"token": "resume", | ||
"start_offset": 0, | ||
"end_offset": 6, | ||
"type": "<ALPHANUM>", | ||
"position": 0 | ||
}, | ||
{ | ||
"token": "résumé", | ||
"start_offset": 0, | ||
"end_offset": 6, | ||
"type": "<ALPHANUM>", | ||
"position": 0 | ||
}, | ||
{ | ||
"token": "cafe", | ||
"start_offset": 7, | ||
"end_offset": 11, | ||
"type": "<ALPHANUM>", | ||
"position": 1 | ||
}, | ||
{ | ||
"token": "café", | ||
"start_offset": 7, | ||
"end_offset": 11, | ||
"type": "<ALPHANUM>", | ||
"position": 1 | ||
}, | ||
{ | ||
"token": "naive", | ||
"start_offset": 12, | ||
"end_offset": 17, | ||
"type": "<ALPHANUM>", | ||
"position": 2 | ||
}, | ||
{ | ||
"token": "naïve", | ||
"start_offset": 12, | ||
"end_offset": 17, | ||
"type": "<ALPHANUM>", | ||
"position": 2 | ||
}, | ||
{ | ||
"token": "coordinate", | ||
"start_offset": 18, | ||
"end_offset": 28, | ||
"type": "<ALPHANUM>", | ||
"position": 3 | ||
}, | ||
{ | ||
"token": "coördinate", | ||
"start_offset": 18, | ||
"end_offset": 28, | ||
"type": "<ALPHANUM>", | ||
"position": 3 | ||
} | ||
] | ||
} | ||
``` | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with 2 words everywhere: "ASCII foldidng"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolchfa-aws thats updated now, thank you for review