From 1886b9345512b1c4999d3cdffff450c52b4ac2da Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Fri, 30 Aug 2024 11:24:05 +0100 Subject: [PATCH 1/7] Add keyword marker token filter docs #8065 Signed-off-by: Anton Rubin --- _analyzers/token-filters/index.md | 2 +- _analyzers/token-filters/keyword-marker.md | 167 +++++++++++++++++++++ 2 files changed, 168 insertions(+), 1 deletion(-) create mode 100644 _analyzers/token-filters/keyword-marker.md diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index f4e9c434e7..f1cb33de74 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -32,7 +32,7 @@ Token filter | Underlying Lucene token filter| Description `hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list. `keep_types` | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type. `keep_word` | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list. -`keyword_marker` | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed. +[`keyword_marker`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keyword-marker/) | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed. `keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword. `kstem` | [KStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html) | Provides kstem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary. `kuromoji_completion` | [JapaneseCompletionFilter](https://lucene.apache.org/core/9_10_0/analysis/kuromoji/org/apache/lucene/analysis/ja/JapaneseCompletionFilter.html) | Adds Japanese romanized terms to the token stream (in addition to the original tokens). Usually used to support autocomplete on Japanese search terms. Note that the filter has a `mode` parameter, which should be set to `index` when used in an index analyzer and `query` when used in a search analyzer. Requires the `analysis-kuromoji` plugin. For information about installing the plugin, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins). diff --git a/_analyzers/token-filters/keyword-marker.md b/_analyzers/token-filters/keyword-marker.md new file mode 100644 index 0000000000..293dfcb8d7 --- /dev/null +++ b/_analyzers/token-filters/keyword-marker.md @@ -0,0 +1,167 @@ +--- +layout: default +title: Keyword marker +parent: Token filters +nav_order: 200 +--- + +# Keyword marker token filter + +The `keyword_marker` token filter in OpenSearch is used to prevent certain tokens from being altered by stemmers or other filters. This ensures that specific words remain in their original form. `keyword_marker` token filter does this by marking tokens as `keywords` which prevents any stemming or other processing. + +## Parameters + +The `keyword_marker` token filter in OpenSearch can be configured with the following parameters: + +- `ignore_case`: Ignore the letter case when matching keywords. Default is `false.` (Boolean, _Optional_) +- `keywords`: List of strings used to match tokens. (List of strings, _Required_ if `keywords_path` or `keywords_pattern` is not set) +- `keywords_path`: Path (relative to `config` directory or absolute) to list of strings. (String, _Required_ if `keywords` or `keywords_pattern` is not set) +- `keywords_pattern`: [Regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) used for matching tokens. (String, _Required_ if `keywords` or `keywords_path` is not set) + + +## Example + +The following example request creates a new index named `my_index` and configures an analyzer with `keyword_marker` filter: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "analyzer": { + "custom_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": ["lowercase", "keyword_marker_filter", "stemmer"] + } + }, + "filter": { + "keyword_marker_filter": { + "type": "keyword_marker", + "keywords": ["OpenSearch", "example"] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the created analyzer: + +```json +GET /my_index/_analyze +{ + "analyzer": "custom_analyzer", + "text": "This is an OpenSearch example demonstrating keyword marker." +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "thi", + "start_offset": 0, + "end_offset": 4, + "type": "", + "position": 0 + }, + { + "token": "is", + "start_offset": 5, + "end_offset": 7, + "type": "", + "position": 1 + }, + { + "token": "an", + "start_offset": 8, + "end_offset": 10, + "type": "", + "position": 2 + }, + { + "token": "opensearch", + "start_offset": 11, + "end_offset": 21, + "type": "", + "position": 3 + }, + { + "token": "example", + "start_offset": 22, + "end_offset": 29, + "type": "", + "position": 4 + }, + { + "token": "demonstr", + "start_offset": 30, + "end_offset": 43, + "type": "", + "position": 5 + }, + { + "token": "keyword", + "start_offset": 44, + "end_offset": 51, + "type": "", + "position": 6 + }, + { + "token": "marker", + "start_offset": 52, + "end_offset": 58, + "type": "", + "position": 7 + } + ] +} +``` + +You can further examine the impact of the `keyword_marker` token filter by adding the following parameters to the `_analyze` query: + +```json +GET /my_index/_analyze +{ + "analyzer": "custom_analyzer", + "text": "This is an OpenSearch example demonstrating keyword marker.", + "explain": true, + "attributes": "keyword" +} +``` +{% include copy-curl.html %} + +This will produce additional details in the response similar to the following: + +```json +{ + "name": "porter_stem", + "tokens": [ + ... + { + "token": "example", + "start_offset": 22, + "end_offset": 29, + "type": "", + "position": 4, + "keyword": true + }, + { + "token": "demonstr", + "start_offset": 30, + "end_offset": 43, + "type": "", + "position": 5, + "keyword": false + }, + ... + ] +} +``` \ No newline at end of file From 910f4b3b8c8eface588b657eb2a18c13b78dcd61 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Thu, 12 Sep 2024 11:05:14 +0100 Subject: [PATCH 2/7] Update keyword-marker.md Signed-off-by: AntonEliatra --- _analyzers/token-filters/keyword-marker.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_analyzers/token-filters/keyword-marker.md b/_analyzers/token-filters/keyword-marker.md index 293dfcb8d7..2a2b03f11f 100644 --- a/_analyzers/token-filters/keyword-marker.md +++ b/_analyzers/token-filters/keyword-marker.md @@ -49,7 +49,7 @@ PUT /my_index ## Generated tokens -Use the following request to examine the tokens generated using the created analyzer: +Use the following request to examine the tokens generated using the analyzer: ```json GET /my_index/_analyze @@ -164,4 +164,4 @@ This will produce additional details in the response similar to the following: ... ] } -``` \ No newline at end of file +``` From 58eee0d008f41b4c89fdf914d6b21b4c2e37a4fc Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Wed, 16 Oct 2024 18:52:11 +0100 Subject: [PATCH 3/7] updating parameter table Signed-off-by: Anton Rubin --- _analyzers/token-filters/keyword-marker.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/_analyzers/token-filters/keyword-marker.md b/_analyzers/token-filters/keyword-marker.md index 2a2b03f11f..c0213736ff 100644 --- a/_analyzers/token-filters/keyword-marker.md +++ b/_analyzers/token-filters/keyword-marker.md @@ -11,12 +11,14 @@ The `keyword_marker` token filter in OpenSearch is used to prevent certain token ## Parameters -The `keyword_marker` token filter in OpenSearch can be configured with the following parameters: - -- `ignore_case`: Ignore the letter case when matching keywords. Default is `false.` (Boolean, _Optional_) -- `keywords`: List of strings used to match tokens. (List of strings, _Required_ if `keywords_path` or `keywords_pattern` is not set) -- `keywords_path`: Path (relative to `config` directory or absolute) to list of strings. (String, _Required_ if `keywords` or `keywords_pattern` is not set) -- `keywords_pattern`: [Regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) used for matching tokens. (String, _Required_ if `keywords` or `keywords_path` is not set) +The `keyword_marker` token filter in OpenSearch can be configured with the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`ignore_case` | Optional | Boolean | Ignore the letter case when matching keywords. Default is `false.` +`keywords` | Required if `keywords_path` or `keywords_pattern` is not set | List of strings | List of strings used to match tokens. +`keywords_path` | Required if `keywords` or `keywords_pattern` is not set | String | Path (relative to `config` directory or absolute) to list of strings. +`keywords_pattern` | Required if `keywords` or `keywords_path` is not set | String | [Regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) used for matching tokens. ## Example From ade879d2ba736778c22a394a3ee24f1452575b7d Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Tue, 12 Nov 2024 13:56:49 +0000 Subject: [PATCH 4/7] Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra --- _analyzers/token-filters/keyword-marker.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/_analyzers/token-filters/keyword-marker.md b/_analyzers/token-filters/keyword-marker.md index c0213736ff..f19c660596 100644 --- a/_analyzers/token-filters/keyword-marker.md +++ b/_analyzers/token-filters/keyword-marker.md @@ -7,18 +7,18 @@ nav_order: 200 # Keyword marker token filter -The `keyword_marker` token filter in OpenSearch is used to prevent certain tokens from being altered by stemmers or other filters. This ensures that specific words remain in their original form. `keyword_marker` token filter does this by marking tokens as `keywords` which prevents any stemming or other processing. +The `keyword_marker` token filter is used to prevent certain tokens from being altered by stemmers or other filters. The `keyword_marker` token filter does this by marking the specified tokens as `keywords`, which prevents any stemming or other processing. This ensures that specific words remain in their original form. ## Parameters -The `keyword_marker` token filter in OpenSearch can be configured with the following parameters. +The `keyword_marker` token filter can be configured with the following parameters. Parameter | Required/Optional | Data type | Description :--- | :--- | :--- | :--- -`ignore_case` | Optional | Boolean | Ignore the letter case when matching keywords. Default is `false.` -`keywords` | Required if `keywords_path` or `keywords_pattern` is not set | List of strings | List of strings used to match tokens. -`keywords_path` | Required if `keywords` or `keywords_pattern` is not set | String | Path (relative to `config` directory or absolute) to list of strings. -`keywords_pattern` | Required if `keywords` or `keywords_path` is not set | String | [Regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) used for matching tokens. +`ignore_case` | Optional | Boolean | Whether to ignore the letter case when matching keywords. Default is `false`. +`keywords` | Required if `keywords_path` or `keywords_pattern` is not set | List of strings | List of tokens to mark as keywords. +`keywords_path` | Required if `keywords` or `keywords_pattern` is not set | String | Path (relative to the `config` directory or absolute) to the list of keywords. +`keywords_pattern` | Required if `keywords` or `keywords_path` is not set | String | [Regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) used for matching tokens to be marked as keywords. ## Example From b8175e1fe4b476d616b054b6d32fe7e9becb15d9 Mon Sep 17 00:00:00 2001 From: Fanit Kolchina Date: Thu, 14 Nov 2024 15:40:13 -0500 Subject: [PATCH 5/7] Change example Signed-off-by: Fanit Kolchina --- _analyzers/token-filters/keyword-marker.md | 60 ++++------------------ 1 file changed, 9 insertions(+), 51 deletions(-) diff --git a/_analyzers/token-filters/keyword-marker.md b/_analyzers/token-filters/keyword-marker.md index f19c660596..7246b59895 100644 --- a/_analyzers/token-filters/keyword-marker.md +++ b/_analyzers/token-filters/keyword-marker.md @@ -23,7 +23,7 @@ Parameter | Required/Optional | Data type | Description ## Example -The following example request creates a new index named `my_index` and configures an analyzer with `keyword_marker` filter: +The following example request creates a new index named `my_index` and configures an analyzer with `keyword_marker` filter. The filter marks the word `example` as a keyword: ```json PUT /my_index @@ -40,7 +40,7 @@ PUT /my_index "filter": { "keyword_marker_filter": { "type": "keyword_marker", - "keywords": ["OpenSearch", "example"] + "keywords": ["example"] } } } @@ -57,71 +57,29 @@ Use the following request to examine the tokens generated using the analyzer: GET /my_index/_analyze { "analyzer": "custom_analyzer", - "text": "This is an OpenSearch example demonstrating keyword marker." + "text": "Favorite example" } ``` {% include copy-curl.html %} -The response contains the generated tokens: +The response contains the generated tokens. Note that the word `favorite` was stemmed but the word `example` was not stemmed because it was marked as a keyword: ```json { "tokens": [ { - "token": "thi", + "token": "favorit", "start_offset": 0, - "end_offset": 4, + "end_offset": 8, "type": "", "position": 0 }, - { - "token": "is", - "start_offset": 5, - "end_offset": 7, - "type": "", - "position": 1 - }, - { - "token": "an", - "start_offset": 8, - "end_offset": 10, - "type": "", - "position": 2 - }, - { - "token": "opensearch", - "start_offset": 11, - "end_offset": 21, - "type": "", - "position": 3 - }, { "token": "example", - "start_offset": 22, - "end_offset": 29, + "start_offset": 9, + "end_offset": 16, "type": "", - "position": 4 - }, - { - "token": "demonstr", - "start_offset": 30, - "end_offset": 43, - "type": "", - "position": 5 - }, - { - "token": "keyword", - "start_offset": 44, - "end_offset": 51, - "type": "", - "position": 6 - }, - { - "token": "marker", - "start_offset": 52, - "end_offset": 58, - "type": "", - "position": 7 + "position": 1 } ] } From fb473df300c650b69180315956067cf15de5d872 Mon Sep 17 00:00:00 2001 From: Fanit Kolchina Date: Thu, 14 Nov 2024 15:50:37 -0500 Subject: [PATCH 6/7] Add article to elision token filter Signed-off-by: Fanit Kolchina --- _analyzers/token-filters/elision.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_analyzers/token-filters/elision.md b/_analyzers/token-filters/elision.md index b5dd5134b6..abc6dba658 100644 --- a/_analyzers/token-filters/elision.md +++ b/_analyzers/token-filters/elision.md @@ -24,7 +24,7 @@ Parameter | Required/Optional | Data type | Description ## Example -The default set of French elisions is `l'`, `m'`, `t'`, `qu'`, `n'`, `s'`, `j'`, `d'`, `c'`, `jusqu'`, `quoiqu'`, `lorsqu'`, and `puisqu'`. You can update this by configuring the `french_elision` token filter. The following example request creates a new index named `french_texts` and configures an analyzer with the `french_elision` filter: +The default set of French elisions is `l'`, `m'`, `t'`, `qu'`, `n'`, `s'`, `j'`, `d'`, `c'`, `jusqu'`, `quoiqu'`, `lorsqu'`, and `puisqu'`. You can update this by configuring the `french_elision` token filter. The following example request creates a new index named `french_texts` and configures an analyzer with a `french_elision` filter: ```json PUT /french_texts From fd0f8671e76076f657a8e1a23376fb106f92ce54 Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Thu, 14 Nov 2024 15:59:06 -0500 Subject: [PATCH 7/7] Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --- _analyzers/token-filters/keyword-marker.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/_analyzers/token-filters/keyword-marker.md b/_analyzers/token-filters/keyword-marker.md index 7246b59895..0ec2cb96f5 100644 --- a/_analyzers/token-filters/keyword-marker.md +++ b/_analyzers/token-filters/keyword-marker.md @@ -16,14 +16,14 @@ The `keyword_marker` token filter can be configured with the following parameter Parameter | Required/Optional | Data type | Description :--- | :--- | :--- | :--- `ignore_case` | Optional | Boolean | Whether to ignore the letter case when matching keywords. Default is `false`. -`keywords` | Required if `keywords_path` or `keywords_pattern` is not set | List of strings | List of tokens to mark as keywords. -`keywords_path` | Required if `keywords` or `keywords_pattern` is not set | String | Path (relative to the `config` directory or absolute) to the list of keywords. -`keywords_pattern` | Required if `keywords` or `keywords_path` is not set | String | [Regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) used for matching tokens to be marked as keywords. +`keywords` | Required if either `keywords_path` or `keywords_pattern` is not set | List of strings | The list of tokens to mark as keywords. +`keywords_path` | Required if either `keywords` or `keywords_pattern` is not set | String | The path (relative to the `config` directory or absolute) to the list of keywords. +`keywords_pattern` | Required if either `keywords` or `keywords_path` is not set | String | A [regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) used for matching tokens to be marked as keywords. ## Example -The following example request creates a new index named `my_index` and configures an analyzer with `keyword_marker` filter. The filter marks the word `example` as a keyword: +The following example request creates a new index named `my_index` and configures an analyzer with a `keyword_marker` filter. The filter marks the word `example` as a keyword: ```json PUT /my_index @@ -62,7 +62,7 @@ GET /my_index/_analyze ``` {% include copy-curl.html %} -The response contains the generated tokens. Note that the word `favorite` was stemmed but the word `example` was not stemmed because it was marked as a keyword: +The response contains the generated tokens. Note that while the word `favorite` was stemmed, the word `example` was not stemmed because it was marked as a keyword: ```json {