Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add keyword marker token filter docs #8065 #8134

2 changes: 1 addition & 1 deletion _analyzers/token-filters/elision.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Parameter | Required/Optional | Data type | Description

## Example

The default set of French elisions is `l'`, `m'`, `t'`, `qu'`, `n'`, `s'`, `j'`, `d'`, `c'`, `jusqu'`, `quoiqu'`, `lorsqu'`, and `puisqu'`. You can update this by configuring the `french_elision` token filter. The following example request creates a new index named `french_texts` and configures an analyzer with the `french_elision` filter:
The default set of French elisions is `l'`, `m'`, `t'`, `qu'`, `n'`, `s'`, `j'`, `d'`, `c'`, `jusqu'`, `quoiqu'`, `lorsqu'`, and `puisqu'`. You can update this by configuring the `french_elision` token filter. The following example request creates a new index named `french_texts` and configures an analyzer with a `french_elision` filter:

```json
PUT /french_texts
Expand Down
2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Token filter | Underlying Lucene token filter| Description
`hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.
[`keep_types`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-types/) | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type.
`keep_word` | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list.
`keyword_marker` | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed.
[`keyword_marker`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keyword-marker/) | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed.
`keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword.
`kstem` | [KStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html) | Provides kstem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary.
`kuromoji_completion` | [JapaneseCompletionFilter](https://lucene.apache.org/core/9_10_0/analysis/kuromoji/org/apache/lucene/analysis/ja/JapaneseCompletionFilter.html) | Adds Japanese romanized terms to the token stream (in addition to the original tokens). Usually used to support autocomplete on Japanese search terms. Note that the filter has a `mode` parameter, which should be set to `index` when used in an index analyzer and `query` when used in a search analyzer. Requires the `analysis-kuromoji` plugin. For information about installing the plugin, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins).
Expand Down
127 changes: 127 additions & 0 deletions _analyzers/token-filters/keyword-marker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
---
layout: default
title: Keyword marker
parent: Token filters
nav_order: 200
---

# Keyword marker token filter

The `keyword_marker` token filter is used to prevent certain tokens from being altered by stemmers or other filters. The `keyword_marker` token filter does this by marking the specified tokens as `keywords`, which prevents any stemming or other processing. This ensures that specific words remain in their original form.

## Parameters

The `keyword_marker` token filter can be configured with the following parameters.

Parameter | Required/Optional | Data type | Description
:--- | :--- | :--- | :---
`ignore_case` | Optional | Boolean | Whether to ignore the letter case when matching keywords. Default is `false`.
`keywords` | Required if `keywords_path` or `keywords_pattern` is not set | List of strings | List of tokens to mark as keywords.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved
`keywords_path` | Required if `keywords` or `keywords_pattern` is not set | String | Path (relative to the `config` directory or absolute) to the list of keywords.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved
`keywords_pattern` | Required if `keywords` or `keywords_path` is not set | String | [Regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) used for matching tokens to be marked as keywords.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved


## Example

The following example request creates a new index named `my_index` and configures an analyzer with `keyword_marker` filter. The filter marks the word `example` as a keyword:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

```json
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "keyword_marker_filter", "stemmer"]
}
},
"filter": {
"keyword_marker_filter": {
"type": "keyword_marker",
"keywords": ["example"]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
GET /my_index/_analyze
{
"analyzer": "custom_analyzer",
"text": "Favorite example"
}
```
{% include copy-curl.html %}

The response contains the generated tokens. Note that the word `favorite` was stemmed but the word `example` was not stemmed because it was marked as a keyword:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

```json
{
"tokens": [
{
"token": "favorit",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "example",
"start_offset": 9,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 1
}
]
}
```

You can further examine the impact of the `keyword_marker` token filter by adding the following parameters to the `_analyze` query:

```json
GET /my_index/_analyze
{
"analyzer": "custom_analyzer",
"text": "This is an OpenSearch example demonstrating keyword marker.",
"explain": true,
"attributes": "keyword"
}
```
{% include copy-curl.html %}

This will produce additional details in the response similar to the following:

```json
{
"name": "porter_stem",
"tokens": [
...
{
"token": "example",
"start_offset": 22,
"end_offset": 29,
"type": "<ALPHANUM>",
"position": 4,
"keyword": true
},
{
"token": "demonstr",
"start_offset": 30,
"end_offset": 43,
"type": "<ALPHANUM>",
"position": 5,
"keyword": false
},
...
]
}
```
Loading