Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Ascii folding token filter #7912

Merged
2 changes: 1 addition & 1 deletion _analyzers/token-filters/apostrophe.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: default
title: Apostrophe
parent: Token filters
nav_order: 110
nav_order: 10
---

# Apostrophe token filter
Expand Down
135 changes: 135 additions & 0 deletions _analyzers/token-filters/asciifolding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
layout: default
title: ASCIIFolding
parent: Token filters
nav_order: 20
---

# ASCIIFolding token filter

Check failure on line 8 in _analyzers/token-filters/asciifolding.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'ASCIIFolding token filter' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'ASCIIFolding token filter' is a heading and should be in sentence case.", "location": {"path": "_analyzers/token-filters/asciifolding.md", "range": {"start": {"line": 8, "column": 3}}}, "severity": "ERROR"}

Check failure on line 8 in _analyzers/token-filters/asciifolding.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: ASCIIFolding. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: ASCIIFolding. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_analyzers/token-filters/asciifolding.md", "range": {"start": {"line": 8, "column": 3}}}, "severity": "ERROR"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with 2 words everywhere: "ASCII foldidng"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws thats updated now, thank you for review


`asciifolding` is a token filter that converts non-ASCII characters into their closest ASCII equivalents. For example *é* becomes *e*, *ü* becomes *u* and *ñ* becomes *n*. This process is also known as *transliteration*.
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved


`asciifolding` token filter offers a number of benefits:
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved

- __Enhanced Search Flexibility__: Users often omit accents or special characters when typing queries. ASCIIFolding ensures that such queries still return relevant results.

Check failure on line 15 in _analyzers/token-filters/asciifolding.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Flexibility__. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Flexibility__. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_analyzers/token-filters/asciifolding.md", "range": {"start": {"line": 15, "column": 23}}}, "severity": "ERROR"}

Check failure on line 15 in _analyzers/token-filters/asciifolding.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: ASCIIFolding. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: ASCIIFolding. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_analyzers/token-filters/asciifolding.md", "range": {"start": {"line": 15, "column": 106}}}, "severity": "ERROR"}
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved
- __Normalization__: Standardizes the indexing process by ensuring that accented characters are consistently converted to their ASCII equivalents.
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved
- __Internationalization__: Particularly useful for applications dealing with multiple languages and character sets.
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved

*Loss of Information*: While ASCIIFolding can simplify searches, it might also lead to loss of specific information, particularly if the distinction between accented and non-accented characters is significant in the dataset.

Check failure on line 19 in _analyzers/token-filters/asciifolding.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: ASCIIFolding. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: ASCIIFolding. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_analyzers/token-filters/asciifolding.md", "range": {"start": {"line": 19, "column": 30}}}, "severity": "ERROR"}
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved
{: .warning}

## Parameters

You can configure `asciifolding` token filter using parameter `preserve_original`. Setting this option to `true` keeps both the original token and the ASCII-folded version in the token stream. This can be particularly useful in scenarios where you want to match both the original (with accents) and the normalized (without accents) versions of a term in search queries. Default is `false`.
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved

## Example

Following example request creates a new index named `example_index` and defines an analyzer with the `asciifolding` filter and `preserve_original` parameter set to `true`:
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved

```json
PUT /example_index
{
"settings": {
"analysis": {
"filter": {
"custom_ascii_folding": {
"type": "asciifolding",
"preserve_original": true
}
},
"analyzer": {
"custom_ascii_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"custom_ascii_folding"
]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the created analyzer:
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved

```json
POST /example_index/_analyze
{
"analyzer": "custom_ascii_analyzer",
"text": "Résumé café naïve coördinate"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "resume",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "résumé",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "cafe",
"start_offset": 7,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "café",
"start_offset": 7,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "naive",
"start_offset": 12,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "naïve",
"start_offset": 12,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "coordinate",
"start_offset": 18,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "coördinate",
"start_offset": 18,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 3
}
]
}
```


2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ The following table lists all token filters that OpenSearch supports.

Token filter | Underlying Lucene token filter| Description
[`apostrophe`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/apostrophe/) | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token containing an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following it.
`asciifolding` | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters.
[`asciifolding`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/asciifolding/) | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters.
`cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens.
`cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules: <br> - Folds full-width ASCII character variants into the equivalent basic Latin characters. <br> - Folds half-width Katakana character variants into the equivalent Kana characters.
`classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.
Expand Down
Loading