Add keyword marker token filter docs #8065

Signed-off-by: Anton Rubin <[email protected]>
opensearch-project · Aug 30, 2024 · 1886b93 · 1886b93
1 parent 6eccc88
commit 1886b93
Show file tree

Hide file tree

Showing 2 changed files with 168 additions and 1 deletion.
diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md
@@ -32,7 +32,7 @@ Token filter | Underlying Lucene token filter|  Description
 `hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.
 `keep_types` | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type.
 `keep_word` | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list.
-`keyword_marker` | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed.
+[`keyword_marker`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keyword-marker/) | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed.
 `keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword.
 `kstem` | [KStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html) | Provides kstem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary.
 `kuromoji_completion` | [JapaneseCompletionFilter](https://lucene.apache.org/core/9_10_0/analysis/kuromoji/org/apache/lucene/analysis/ja/JapaneseCompletionFilter.html) | Adds Japanese romanized terms to the token stream (in addition to the original tokens). Usually used to support autocomplete on Japanese search terms. Note that the filter has a `mode` parameter, which should be set to `index` when used in an index analyzer and `query` when used in a search analyzer. Requires the `analysis-kuromoji` plugin. For information about installing the plugin, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins).

diff --git a/_analyzers/token-filters/keyword-marker.md b/_analyzers/token-filters/keyword-marker.md
@@ -0,0 +1,167 @@
+---
+layout: default
+title: Keyword marker
+parent: Token filters
+nav_order: 200
+---
+
+# Keyword marker token filter
+
+The `keyword_marker` token filter in OpenSearch is used to prevent certain tokens from being altered by stemmers or other filters. This ensures that specific words remain in their original form. `keyword_marker` token filter does this by marking tokens as `keywords` which prevents any stemming or other processing.
+
+## Parameters
+
+The `keyword_marker` token filter in OpenSearch can be configured with the following parameters:
+
+- `ignore_case`: Ignore the letter case when matching keywords. Default is `false.` (Boolean, _Optional_)
+- `keywords`: List of strings used to match tokens. (List of strings, _Required_ if `keywords_path` or `keywords_pattern` is not set)
+- `keywords_path`: Path (relative to `config` directory or absolute) to list of strings. (String, _Required_ if `keywords` or `keywords_pattern` is not set)
+- `keywords_pattern`: [Regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) used for matching tokens. (String, _Required_ if `keywords` or `keywords_path` is not set)
+
+
+## Example
+
+The following example request creates a new index named `my_index` and configures an analyzer with `keyword_marker` filter:
+
+```json
+PUT /my_index
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "custom_analyzer": {
+          "type": "custom",
+          "tokenizer": "standard",
+          "filter": ["lowercase", "keyword_marker_filter", "stemmer"]
+        }
+      },
+      "filter": {
+        "keyword_marker_filter": {
+          "type": "keyword_marker",
+          "keywords": ["OpenSearch", "example"]
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the created analyzer:
+
+```json
+GET /my_index/_analyze
+{
+  "analyzer": "custom_analyzer",
+  "text": "This is an OpenSearch example demonstrating keyword marker."
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+  "tokens": [
+    {
+      "token": "thi",
+      "start_offset": 0,
+      "end_offset": 4,
+      "type": "<ALPHANUM>",
+      "position": 0
+    },
+    {
+      "token": "is",
+      "start_offset": 5,
+      "end_offset": 7,
+      "type": "<ALPHANUM>",
+      "position": 1
+    },
+    {
+      "token": "an",
+      "start_offset": 8,
+      "end_offset": 10,
+      "type": "<ALPHANUM>",
+      "position": 2
+    },
+    {
+      "token": "opensearch",
+      "start_offset": 11,
+      "end_offset": 21,
+      "type": "<ALPHANUM>",
+      "position": 3
+    },
+    {
+      "token": "example",
+      "start_offset": 22,
+      "end_offset": 29,
+      "type": "<ALPHANUM>",
+      "position": 4
+    },
+    {
+      "token": "demonstr",
+      "start_offset": 30,
+      "end_offset": 43,
+      "type": "<ALPHANUM>",
+      "position": 5
+    },
+    {
+      "token": "keyword",
+      "start_offset": 44,
+      "end_offset": 51,
+      "type": "<ALPHANUM>",
+      "position": 6
+    },
+    {
+      "token": "marker",
+      "start_offset": 52,
+      "end_offset": 58,
+      "type": "<ALPHANUM>",
+      "position": 7
+    }
+  ]
+}
+```
+
+You can further examine the impact of the `keyword_marker` token filter by adding the following parameters to the `_analyze` query:
+
+```json
+GET /my_index/_analyze
+{
+  "analyzer": "custom_analyzer",
+  "text": "This is an OpenSearch example demonstrating keyword marker.",
+  "explain": true,
+  "attributes": "keyword"
+}
+```
+{% include copy-curl.html %}
+
+This will produce additional details in the response similar to the following:
+
+```json
+{
+    "name": "porter_stem",
+    "tokens": [
+      ...
+      {
+        "token": "example",
+        "start_offset": 22,
+        "end_offset": 29,
+        "type": "<ALPHANUM>",
+        "position": 4,
+        "keyword": true
+      },
+      {
+        "token": "demonstr",
+        "start_offset": 30,
+        "end_offset": 43,
+        "type": "<ALPHANUM>",
+        "position": 5,
+        "keyword": false
+      },
+      ...
+    ]
+}
+```