Skip to content

Commit

Permalink
adding hunspell token filter #8061
Browse files Browse the repository at this point in the history
Signed-off-by: Anton Rubin <[email protected]>
  • Loading branch information
AntonEliatra committed Aug 22, 2024
1 parent 6eccc88 commit b7e09d5
Show file tree
Hide file tree
Showing 2 changed files with 103 additions and 1 deletion.
102 changes: 102 additions & 0 deletions _analyzers/token-filters/hunspell.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
---
layout: default
title: Hunspell
parent: Token filters
nav_order: 160
---

# Hunspell token filter

The `hunspell` token filter in OpenSearch is used for stemming and morphological analysis of words in a specific language. This filter leverages Hunspell dictionaries, which are widely used in spell checkers. It works by breaking down words into their root forms (stemming).

The Hunspell dictionary files are automatically loaded at startup from `<OS_PATH_CONF>/hunspell/<locale>` directory. For example `en_GB` locale should have at least one `.aff` file and one or more `.dic` files in directory `<OS_PATH_CONF>/hunspell/en_GB/`.

## Parameters

The `hunspell` token filter in OpenSearch can be configured with the following parameters:

- `language/lang/locale`: Specifies the language for the Hunspell dictionary. (String, _Required_ at least one of the three)
- `dictionary`: Configures the dictionary files to be used for Hunspell dictionary. Default is all files in `<OS_PATH_CONF>/hunspell/<locale>` directory. (Array of strings, _Optional_)
- `longest_only`: Specifies if only the longest stemmed version of the token should be returned. Default is `false` (Boolean, _Optional_)

## Example

The following example request creates a new index named `my_index` and configures an analyzer with `hunspell` filter:

```json
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_hunspell_filter": {
"type": "hunspell",
"lang": "en_GB",
"longest_only": true
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_hunspell_filter"
]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the created analyzer:

```json
POST /my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "the turtle moves slowly"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "turtle",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "move",
"start_offset": 11,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "slow",
"start_offset": 17,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 3
}
]
}
```
2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Token filter | Underlying Lucene token filter| Description
`elision` | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane).
`fingerprint` | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token.
`flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing.
`hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries.
[`hunspell`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hunspell/) | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries.
`hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.
`keep_types` | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type.
`keep_word` | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list.
Expand Down

0 comments on commit b7e09d5

Please sign in to comment.