Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dictionary decompounder docs #7979 #7994

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions _analyzers/token-filters/dictionary-decompounder.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
layout: default
title: Dictionary decompounder
parent: Token filters
nav_order: 110
---

# Dictionary decompounder token filter

Check failure on line 8 in _analyzers/token-filters/dictionary-decompounder.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: decompounder. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: decompounder. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_analyzers/token-filters/dictionary-decompounder.md", "range": {"start": {"line": 8, "column": 14}}}, "severity": "ERROR"}

The `dictionary_decompounder` token filter is used to split compound words into their constituent parts based on a predefined dictionary. This filter is particularly useful for languages like German, Dutch, or Finnish, in which compound words are common, so breaking them down can improve search relevance. The `dictionary_decompounder` token filter determines whether each token (word) can be split into smaller tokens based on a list of known words. If the token can be split into known words, the filter generates the subtokens for the token.

Check failure on line 10 in _analyzers/token-filters/dictionary-decompounder.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: subtokens. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: subtokens. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_analyzers/token-filters/dictionary-decompounder.md", "range": {"start": {"line": 10, "column": 522}}}, "severity": "ERROR"}

## Parameters

The `dictionary_decompounder` token filter has the following parameters.

Parameter | Required/Optional | Data type | Description
:--- | :--- | :--- | :---
`word_list` | Required unless `word_list_path` is configured | Array of strings | The dictionary of words that the filter uses to split compound words.
`word_list_path` | Required unless `word_list` is configured | String | A file path to a text file containing the dictionary words. Accepts either an absolute path or a path relative to the `config` directory. The dictionary file must be UTF-8 encoded, and each word must be listed on a separate line.
`min_word_size` | Optional | Integer | The minimum length of the entire compound word that will be considered for splitting. If a compound word is shorter than this value, it is not split. Default is `5`.
`min_subword_size` | Optional | Integer | The minimum length for any subword. If a subword is shorter than this value, it is not included in the output. Default is `2`.
`max_subword_size` | Optional | Integer | The maximum length for any subword. If a subword is longer than this value, it is not included in the output. Default is `15`.
`only_longest_match` | Optional | Boolean | If set to `true`, only the longest matching subword will be returned. Default is `false`.

## Example

The following example request creates a new index named `decompound_example` and configures an analyzer with the `dictionary_decompounder` filter:

```json
PUT /decompound_example
{
"settings": {
"analysis": {
"filter": {
"my_dictionary_decompounder": {
"type": "dictionary_decompounder",
"word_list": ["slow", "green", "turtle"]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "my_dictionary_decompounder"]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
POST /decompound_example/_analyze
{
"analyzer": "my_analyzer",
"text": "slowgreenturtleswim"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "slowgreenturtleswim",
"start_offset": 0,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "slow",
"start_offset": 0,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "green",
"start_offset": 0,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "turtle",
"start_offset": 0,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 0
}
]
}
```
2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Token filter | Underlying Lucene token filter| Description
[`decimal_digit`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/decimal-digit/) | [DecimalDigitFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/DecimalDigitFilter.html) | Converts all digits in the Unicode decimal number general category to basic Latin digits (0--9).
[`delimited_payload`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/delimited-payload/) | [DelimitedPayloadTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.html) | Separates a token stream into tokens with corresponding payloads, based on a provided delimiter. A token consists of all characters preceding the delimiter, and a payload consists of all characters following the delimiter. For example, if the delimiter is `|`, then for the string `foo|bar`, `foo` is the token and `bar` is the payload.
[`delimited_term_freq`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/delimited-term-frequency/) | [DelimitedTermFrequencyTokenFilter](https://lucene.apache.org/core/9_7_0/analysis/common/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.html) | Separates a token stream into tokens with corresponding term frequencies, based on a provided delimiter. A token consists of all characters before the delimiter, and a term frequency is the integer after the delimiter. For example, if the delimiter is `|`, then for the string `foo|5`, `foo` is the token and `5` is the term frequency.
`dictionary_decompounder` | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Decomposes compound words found in many Germanic languages.
[`dictionary_decompounder`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/dictionary-decompounder/) | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Splits compound words into their constituent parts based on a predefined dictionary. Useful for many Germanic languages.
`edge_ngram` | [EdgeNGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved
`elision` | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane).
`fingerprint` | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token.
Expand Down
Loading