From a7d1badaf0e58f651be51e4987bc1410055f379f Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Wed, 7 Aug 2024 17:08:48 +0100 Subject: [PATCH 1/9] adding common_gram token filter page #7923 Signed-off-by: AntonEliatra --- _analyzers/token-filters/common_gram.md | 165 ++++++++++++++++++++++++ _analyzers/token-filters/index.md | 2 +- 2 files changed, 166 insertions(+), 1 deletion(-) create mode 100644 _analyzers/token-filters/common_gram.md diff --git a/_analyzers/token-filters/common_gram.md b/_analyzers/token-filters/common_gram.md new file mode 100644 index 0000000000..9d4a167511 --- /dev/null +++ b/_analyzers/token-filters/common_gram.md @@ -0,0 +1,165 @@ +--- +layout: default +title: common_grams +parent: Token filters +nav_order: 60 +--- + +# Common_grams token filter + +The `common_grams` token filter in OpenSearch improves search relevance by keeping commonly occurring phrases (common grams) in the text. This is useful when dealing with languages or datasets where certain word combinations frequently occur and can impact the search relevance if treated as separate tokens. + +Using this token filter improves search relevance by keeping common phrases intact, it can help in matching queries more accurately, particularly for frequent word combinations. It also improves search precision by reducing the number of irrelevant matches. + +Using this filter requires careful selection and maintenance of the list of common words +{: .warning} + +## Parameters + +`common_grams` token filter can be configured with several parameters to control its behavior. + +Parameter | Description | Example +`common_words` | A list of words that should be considered as common words. These words will be used to form common grams. (Required) | ["the", "and", "of"] +`ignore_case` | Indicates whether the filter should ignore case differences when matching common words. | `true` or `false` (Default: `false`) +`query_mode` | When set to true, the filter only emits common grams during the analysis phase (useful during query time to ensure the query matches documents analyzed with the same filter). | `true` or `false` (Default: `false`) + + +## Example + +The following example request creates a new index named `my_common_grams_index` and configures an analyzer with the `common_grams` filter: + +```json +PUT /my_common_grams_index +{ + "settings": { + "analysis": { + "filter": { + "my_common_grams_filter": { + "type": "common_grams", + "common_words": ["a", "in", "for"], + "ignore_case": true, + "query_mode": true + } + }, + "analyzer": { + "my_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "my_common_grams_filter" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the created analyzer: + +```json +GET /my_common_grams_index/_analyze +{ + "analyzer": "my_analyzer", + "text": "A quick black cat jumps over the lazy dog in the park" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "a_quick", + "start_offset": 0, + "end_offset": 7, + "type": "gram", + "position": 0 + }, + { + "token": "quick", + "start_offset": 2, + "end_offset": 7, + "type": "", + "position": 1 + }, + { + "token": "black", + "start_offset": 8, + "end_offset": 13, + "type": "", + "position": 2 + }, + { + "token": "cat", + "start_offset": 14, + "end_offset": 17, + "type": "", + "position": 3 + }, + { + "token": "jumps", + "start_offset": 18, + "end_offset": 23, + "type": "", + "position": 4 + }, + { + "token": "over", + "start_offset": 24, + "end_offset": 28, + "type": "", + "position": 5 + }, + { + "token": "the", + "start_offset": 29, + "end_offset": 32, + "type": "", + "position": 6 + }, + { + "token": "lazy", + "start_offset": 33, + "end_offset": 37, + "type": "", + "position": 7 + }, + { + "token": "dog_in", + "start_offset": 38, + "end_offset": 44, + "type": "gram", + "position": 8 + }, + { + "token": "in_the", + "start_offset": 42, + "end_offset": 48, + "type": "gram", + "position": 9 + }, + { + "token": "the", + "start_offset": 45, + "end_offset": 48, + "type": "", + "position": 10 + }, + { + "token": "park", + "start_offset": 49, + "end_offset": 53, + "type": "", + "position": 11 + } + ] +} +``` + diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index f4e9c434e7..91bbb476e7 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -18,7 +18,7 @@ Token filter | Underlying Lucene token filter| Description `cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens. `cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into the equivalent basic Latin characters.
- Folds half-width Katakana character variants into the equivalent Kana characters. `classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms. -`common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams. +[`common_grams`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/common_gram/) | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams. `conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script. `decimal_digit` | [DecimalDigitFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/DecimalDigitFilter.html) | Converts all digits in the Unicode decimal number general category to basic Latin digits (0--9). `delimited_payload` | [DelimitedPayloadTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.html) | Separates a token stream into tokens with corresponding payloads, based on a provided delimiter. A token consists of all characters before the delimiter, and a payload consists of all characters after the delimiter. For example, if the delimiter is `|`, then for the string `foo|bar`, `foo` is the token and `bar` is the payload. From 0b0a0c690a8cbecc4b388fe2d832ec0cb49e84c0 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Thu, 12 Sep 2024 11:08:55 +0100 Subject: [PATCH 2/9] Update common_gram.md Signed-off-by: AntonEliatra --- _analyzers/token-filters/common_gram.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_analyzers/token-filters/common_gram.md b/_analyzers/token-filters/common_gram.md index 9d4a167511..6395bfce3b 100644 --- a/_analyzers/token-filters/common_gram.md +++ b/_analyzers/token-filters/common_gram.md @@ -4,9 +4,9 @@ title: common_grams parent: Token filters nav_order: 60 --- - + # Common_grams token filter - + The `common_grams` token filter in OpenSearch improves search relevance by keeping commonly occurring phrases (common grams) in the text. This is useful when dealing with languages or datasets where certain word combinations frequently occur and can impact the search relevance if treated as separate tokens. Using this token filter improves search relevance by keeping common phrases intact, it can help in matching queries more accurately, particularly for frequent word combinations. It also improves search precision by reducing the number of irrelevant matches. @@ -59,7 +59,7 @@ PUT /my_common_grams_index ## Generated tokens -Use the following request to examine the tokens generated using the created analyzer: +Use the following request to examine the tokens generated using the analyzer: ```json GET /my_common_grams_index/_analyze From b87c53eba22f294754dd4fbd43be33f4c758aca9 Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Thu, 10 Oct 2024 16:18:15 +0100 Subject: [PATCH 3/9] addressing the PR comments Signed-off-by: Anton Rubin --- _analyzers/token-filters/common_gram.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/_analyzers/token-filters/common_gram.md b/_analyzers/token-filters/common_gram.md index 6395bfce3b..7e3e0a9d9b 100644 --- a/_analyzers/token-filters/common_gram.md +++ b/_analyzers/token-filters/common_gram.md @@ -7,7 +7,7 @@ nav_order: 60 # Common_grams token filter -The `common_grams` token filter in OpenSearch improves search relevance by keeping commonly occurring phrases (common grams) in the text. This is useful when dealing with languages or datasets where certain word combinations frequently occur and can impact the search relevance if treated as separate tokens. +The `common_grams` token filter in OpenSearch improves search relevance by keeping commonly occurring phrases (common grams) in the text. This is useful when dealing with languages or datasets where certain word combinations frequently occur and can impact the search relevance if treated as separate tokens. If only common words are present in the input string, this token filter generates both unigrams and bigrams. Using this token filter improves search relevance by keeping common phrases intact, it can help in matching queries more accurately, particularly for frequent word combinations. It also improves search precision by reducing the number of irrelevant matches. @@ -16,12 +16,16 @@ Using this filter requires careful selection and maintenance of the list of comm ## Parameters -`common_grams` token filter can be configured with several parameters to control its behavior. +The `common_grams` token filter can be configured with the following parameters: -Parameter | Description | Example -`common_words` | A list of words that should be considered as common words. These words will be used to form common grams. (Required) | ["the", "and", "of"] -`ignore_case` | Indicates whether the filter should ignore case differences when matching common words. | `true` or `false` (Default: `false`) -`query_mode` | When set to true, the filter only emits common grams during the analysis phase (useful during query time to ensure the query matches documents analyzed with the same filter). | `true` or `false` (Default: `false`) +- `common_words`: A list of words that should be considered as common words. These words will be used to form common grams. If the `common_words` parameter is given an empty list, the `common_grams` token filter becomes a pass-through filter, meaning it doesn't modify the input tokens at all. (List of strings, _Required_) +- `ignore_case`: Indicates whether the filter should ignore case differences when matching common words. Default is `false`. (Boolean, _Optional_) +- `query_mode`: When set to `true` the following rules are applied: + - unigrams that are common_words are not included in the output. + - bigrams where a non-common word is followed by a common_word are retained in the output. + - unigrams of non-common words are excluded if they're immediately followed by a common_word. + - If a non-common word appears at the end of the text and is preceded by a common word, its unigram is also not included in the output. + Default: `false` (Boolean, _Optional_) ## Example From ed3eaf2b9513cda2ffe3bf4d1a1e2f291239086d Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Thu, 10 Oct 2024 16:22:45 +0100 Subject: [PATCH 4/9] addressing the PR comments Signed-off-by: Anton Rubin --- _analyzers/token-filters/common_gram.md | 96 ++++--------------------- 1 file changed, 12 insertions(+), 84 deletions(-) diff --git a/_analyzers/token-filters/common_gram.md b/_analyzers/token-filters/common_gram.md index 7e3e0a9d9b..c973265436 100644 --- a/_analyzers/token-filters/common_gram.md +++ b/_analyzers/token-filters/common_gram.md @@ -79,90 +79,18 @@ The response contains the generated tokens: ```json { "tokens": [ - { - "token": "a_quick", - "start_offset": 0, - "end_offset": 7, - "type": "gram", - "position": 0 - }, - { - "token": "quick", - "start_offset": 2, - "end_offset": 7, - "type": "", - "position": 1 - }, - { - "token": "black", - "start_offset": 8, - "end_offset": 13, - "type": "", - "position": 2 - }, - { - "token": "cat", - "start_offset": 14, - "end_offset": 17, - "type": "", - "position": 3 - }, - { - "token": "jumps", - "start_offset": 18, - "end_offset": 23, - "type": "", - "position": 4 - }, - { - "token": "over", - "start_offset": 24, - "end_offset": 28, - "type": "", - "position": 5 - }, - { - "token": "the", - "start_offset": 29, - "end_offset": 32, - "type": "", - "position": 6 - }, - { - "token": "lazy", - "start_offset": 33, - "end_offset": 37, - "type": "", - "position": 7 - }, - { - "token": "dog_in", - "start_offset": 38, - "end_offset": 44, - "type": "gram", - "position": 8 - }, - { - "token": "in_the", - "start_offset": 42, - "end_offset": 48, - "type": "gram", - "position": 9 - }, - { - "token": "the", - "start_offset": 45, - "end_offset": 48, - "type": "", - "position": 10 - }, - { - "token": "park", - "start_offset": 49, - "end_offset": 53, - "type": "", - "position": 11 - } + {"token": "a_quick","start_offset": 0,"end_offset": 7,"type": "gram","position": 0}, + {"token": "quick","start_offset": 2,"end_offset": 7,"type": "","position": 1}, + {"token": "black","start_offset": 8,"end_offset": 13,"type": "","position": 2}, + {"token": "cat","start_offset": 14,"end_offset": 17,"type": "","position": 3}, + {"token": "jumps","start_offset": 18,"end_offset": 23,"type": "","position": 4}, + {"token": "over","start_offset": 24,"end_offset": 28,"type": "","position": 5}, + {"token": "the","start_offset": 29,"end_offset": 32,"type": "","position": 6}, + {"token": "lazy","start_offset": 33,"end_offset": 37,"type": "","position": 7}, + {"token": "dog_in","start_offset": 38,"end_offset": 44,"type": "gram","position": 8}, + {"token": "in_the","start_offset": 42,"end_offset": 48,"type": "gram","position": 9}, + {"token": "the","start_offset": 45,"end_offset": 48,"type": "","position": 10}, + {"token": "park","start_offset": 49,"end_offset": 53,"type": "","position": 11} ] } ``` From 96f08b1619a92f2f7ad3cf00f5e21bfff5d881d2 Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Thu, 10 Oct 2024 16:35:02 +0100 Subject: [PATCH 5/9] addressing the PR comments Signed-off-by: Anton Rubin --- _analyzers/token-filters/common_gram.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_analyzers/token-filters/common_gram.md b/_analyzers/token-filters/common_gram.md index c973265436..5b0a949fb2 100644 --- a/_analyzers/token-filters/common_gram.md +++ b/_analyzers/token-filters/common_gram.md @@ -7,7 +7,7 @@ nav_order: 60 # Common_grams token filter -The `common_grams` token filter in OpenSearch improves search relevance by keeping commonly occurring phrases (common grams) in the text. This is useful when dealing with languages or datasets where certain word combinations frequently occur and can impact the search relevance if treated as separate tokens. If only common words are present in the input string, this token filter generates both unigrams and bigrams. +The `common_grams` token filter in OpenSearch improves search relevance by keeping commonly occurring phrases (common grams) in the text. This is useful when dealing with languages or datasets where certain word combinations frequently occur and can impact the search relevance if treated as separate tokens. If any common words are present in the input string, this token filter generates both unigrams and bigrams. Using this token filter improves search relevance by keeping common phrases intact, it can help in matching queries more accurately, particularly for frequent word combinations. It also improves search precision by reducing the number of irrelevant matches. From 53c418262c5b0732698f42beb46611819c095689 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Wed, 16 Oct 2024 14:57:45 +0100 Subject: [PATCH 6/9] Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra --- _analyzers/token-filters/common_gram.md | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/_analyzers/token-filters/common_gram.md b/_analyzers/token-filters/common_gram.md index 5b0a949fb2..e99119216a 100644 --- a/_analyzers/token-filters/common_gram.md +++ b/_analyzers/token-filters/common_gram.md @@ -1,17 +1,17 @@ --- layout: default -title: common_grams +title: Common grams parent: Token filters nav_order: 60 --- -# Common_grams token filter +# Common grams token filter -The `common_grams` token filter in OpenSearch improves search relevance by keeping commonly occurring phrases (common grams) in the text. This is useful when dealing with languages or datasets where certain word combinations frequently occur and can impact the search relevance if treated as separate tokens. If any common words are present in the input string, this token filter generates both unigrams and bigrams. +The `common_grams` token filter in OpenSearch improves search relevance by keeping commonly occurring phrases (common grams) in the text. This is useful when dealing with languages or datasets where certain word combinations frequently occur as a unit and can impact the search relevance if treated as separate tokens. If any common words are present in the input string, this token filter generates both their unigrams and bigrams. -Using this token filter improves search relevance by keeping common phrases intact, it can help in matching queries more accurately, particularly for frequent word combinations. It also improves search precision by reducing the number of irrelevant matches. +Using this token filter improves search relevance by keeping common phrases intact. This can help in matching queries more accurately, particularly for frequent word combinations. It also improves search precision by reducing the number of irrelevant matches. -Using this filter requires careful selection and maintenance of the list of common words +When using this filter, you must carefully select and maintain of the `common_words` list. {: .warning} ## Parameters @@ -19,13 +19,12 @@ Using this filter requires careful selection and maintenance of the list of comm The `common_grams` token filter can be configured with the following parameters: - `common_words`: A list of words that should be considered as common words. These words will be used to form common grams. If the `common_words` parameter is given an empty list, the `common_grams` token filter becomes a pass-through filter, meaning it doesn't modify the input tokens at all. (List of strings, _Required_) -- `ignore_case`: Indicates whether the filter should ignore case differences when matching common words. Default is `false`. (Boolean, _Optional_) -- `query_mode`: When set to `true` the following rules are applied: - - unigrams that are common_words are not included in the output. - - bigrams where a non-common word is followed by a common_word are retained in the output. - - unigrams of non-common words are excluded if they're immediately followed by a common_word. - - If a non-common word appears at the end of the text and is preceded by a common word, its unigram is also not included in the output. - Default: `false` (Boolean, _Optional_) +- `ignore_case` (Boolean, _Optional_): Indicates whether the filter should ignore case differences when matching common words. Default is `false`. +- `query_mode` (Boolean, _Optional_): When set to `true`, the following rules are applied: + - unigrams that are generated from `common_words` are not included in the output. + - bigrams in which a non-common word is followed by common word are retained in the output. + - unigrams of non-common words are excluded if they are immediately followed by a common word. + - If a non-common word appears at the end of the text and is preceded by a common word, its unigram is not included in the output. ## Example From c265a72249033388930f9d85b473c64e91c915d4 Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Wed, 16 Oct 2024 15:19:04 +0100 Subject: [PATCH 7/9] addressing the PR comments Signed-off-by: Anton Rubin --- _analyzers/token-filters/common_gram.md | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/_analyzers/token-filters/common_gram.md b/_analyzers/token-filters/common_gram.md index e99119216a..0e68c3f91e 100644 --- a/_analyzers/token-filters/common_gram.md +++ b/_analyzers/token-filters/common_gram.md @@ -11,20 +11,18 @@ The `common_grams` token filter in OpenSearch improves search relevance by keepi Using this token filter improves search relevance by keeping common phrases intact. This can help in matching queries more accurately, particularly for frequent word combinations. It also improves search precision by reducing the number of irrelevant matches. -When using this filter, you must carefully select and maintain of the `common_words` list. +When using this filter, you must carefully select and maintain the `common_words` list. {: .warning} ## Parameters -The `common_grams` token filter can be configured with the following parameters: +The `common_grams` token filter can be configured with the following parameters. -- `common_words`: A list of words that should be considered as common words. These words will be used to form common grams. If the `common_words` parameter is given an empty list, the `common_grams` token filter becomes a pass-through filter, meaning it doesn't modify the input tokens at all. (List of strings, _Required_) -- `ignore_case` (Boolean, _Optional_): Indicates whether the filter should ignore case differences when matching common words. Default is `false`. -- `query_mode` (Boolean, _Optional_): When set to `true`, the following rules are applied: - - unigrams that are generated from `common_words` are not included in the output. - - bigrams in which a non-common word is followed by common word are retained in the output. - - unigrams of non-common words are excluded if they are immediately followed by a common word. - - If a non-common word appears at the end of the text and is preceded by a common word, its unigram is not included in the output. +Parameter | Data type | Description | Required/Optional +:--- | :--- | :--- | :--- +`common_words` | List of strings | A list of words that should be considered as words appearing together. These words will be used to generate common grams. If the `common_words` parameter is an empty list, the `common_grams` token filter becomes a no-op filter, meaning it doesn't modify the input tokens at all. | Required +`ignore_case` | Boolean | Indicates whether the filter should ignore case differences when matching common words. Default is `false`. | Optional +`query_mode` | Boolean | When set to `true`, the following rules are applied:
- unigrams that are generated from `common_words` are not included in the output.
- bigrams in which a non-common word is followed by common word are retained in the output.
- unigrams of non-common words are excluded if they are immediately followed by a common word.
- If a non-common word appears at the end of the text and is preceded by a common word, its unigram is not included in the output. | Optional ## Example From 5534e088040fafde8db9b404640a6025e1f1864d Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Wed, 16 Oct 2024 15:31:52 +0100 Subject: [PATCH 8/9] updating parameter table structure Signed-off-by: Anton Rubin --- _analyzers/token-filters/common_gram.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/_analyzers/token-filters/common_gram.md b/_analyzers/token-filters/common_gram.md index 0e68c3f91e..459bc60567 100644 --- a/_analyzers/token-filters/common_gram.md +++ b/_analyzers/token-filters/common_gram.md @@ -18,11 +18,11 @@ When using this filter, you must carefully select and maintain the `common_words The `common_grams` token filter can be configured with the following parameters. -Parameter | Data type | Description | Required/Optional +Parameter | Required/Optional | Data type | Description :--- | :--- | :--- | :--- -`common_words` | List of strings | A list of words that should be considered as words appearing together. These words will be used to generate common grams. If the `common_words` parameter is an empty list, the `common_grams` token filter becomes a no-op filter, meaning it doesn't modify the input tokens at all. | Required -`ignore_case` | Boolean | Indicates whether the filter should ignore case differences when matching common words. Default is `false`. | Optional -`query_mode` | Boolean | When set to `true`, the following rules are applied:
- unigrams that are generated from `common_words` are not included in the output.
- bigrams in which a non-common word is followed by common word are retained in the output.
- unigrams of non-common words are excluded if they are immediately followed by a common word.
- If a non-common word appears at the end of the text and is preceded by a common word, its unigram is not included in the output. | Optional +`common_words` | Required | List of strings | A list of words that should be considered as words appearing together. These words will be used to generate common grams. If the `common_words` parameter is an empty list, the `common_grams` token filter becomes a no-op filter, meaning it doesn't modify the input tokens at all. +`ignore_case` | Optional | Boolean | Indicates whether the filter should ignore case differences when matching common words. Default is `false`. +`query_mode` | Optional | Boolean | When set to `true`, the following rules are applied:
- Unigrams that are generated from `common_words` are not included in the output.
- Bigrams in which a non-common word is followed by common word are retained in the output.
- Unigrams of non-common words are excluded if they are immediately followed by a common word.
- If a non-common word appears at the end of the text and is preceded by a common word, its unigram is not included in the output. ## Example From a80c82f224d0972c1d7c8049b7edccfb35965030 Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Wed, 16 Oct 2024 11:15:12 -0400 Subject: [PATCH 9/9] Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --- _analyzers/token-filters/common_gram.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_analyzers/token-filters/common_gram.md b/_analyzers/token-filters/common_gram.md index 459bc60567..58f5bbe149 100644 --- a/_analyzers/token-filters/common_gram.md +++ b/_analyzers/token-filters/common_gram.md @@ -7,7 +7,7 @@ nav_order: 60 # Common grams token filter -The `common_grams` token filter in OpenSearch improves search relevance by keeping commonly occurring phrases (common grams) in the text. This is useful when dealing with languages or datasets where certain word combinations frequently occur as a unit and can impact the search relevance if treated as separate tokens. If any common words are present in the input string, this token filter generates both their unigrams and bigrams. +The `common_grams` token filter improves search relevance by keeping commonly occurring phrases (common grams) in the text. This is useful when dealing with languages or datasets in which certain word combinations frequently occur as a unit and can impact search relevance if treated as separate tokens. If any common words are present in the input string, this token filter generates both their unigrams and bigrams. Using this token filter improves search relevance by keeping common phrases intact. This can help in matching queries more accurately, particularly for frequent word combinations. It also improves search precision by reducing the number of irrelevant matches. @@ -20,9 +20,9 @@ The `common_grams` token filter can be configured with the following parameters. Parameter | Required/Optional | Data type | Description :--- | :--- | :--- | :--- -`common_words` | Required | List of strings | A list of words that should be considered as words appearing together. These words will be used to generate common grams. If the `common_words` parameter is an empty list, the `common_grams` token filter becomes a no-op filter, meaning it doesn't modify the input tokens at all. +`common_words` | Required | List of strings | A list of words that should be treated as words that commonly appear together. These words will be used to generate common grams. If the `common_words` parameter is an empty list, the `common_grams` token filter becomes a no-op filter, meaning that it doesn't modify the input tokens at all. `ignore_case` | Optional | Boolean | Indicates whether the filter should ignore case differences when matching common words. Default is `false`. -`query_mode` | Optional | Boolean | When set to `true`, the following rules are applied:
- Unigrams that are generated from `common_words` are not included in the output.
- Bigrams in which a non-common word is followed by common word are retained in the output.
- Unigrams of non-common words are excluded if they are immediately followed by a common word.
- If a non-common word appears at the end of the text and is preceded by a common word, its unigram is not included in the output. +`query_mode` | Optional | Boolean | When set to `true`, the following rules are applied:
- Unigrams that are generated from `common_words` are not included in the output.
- Bigrams in which a non-common word is followed by a common word are retained in the output.
- Unigrams of non-common words are excluded if they are immediately followed by a common word.
- If a non-common word appears at the end of the text and is preceded by a common word, its unigram is not included in the output. ## Example