Skip to content

Commit

Permalink
updating the configs
Browse files Browse the repository at this point in the history
Signed-off-by: AntonEliatra <[email protected]>
  • Loading branch information
AntonEliatra committed Aug 9, 2024
1 parent b410211 commit 995516c
Showing 1 changed file with 20 additions and 23 deletions.
43 changes: 20 additions & 23 deletions _analyzers/token-filters/cjk-bigram.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,29 +18,26 @@ The `cjk_bigram` token filter can be additionally configured with two parameters

This option allows you to specify whether the filter should ignore certain scripts (like Latin, Cyrillic) and only tokenize CJK text into bigrams. The default is to ignore non-CJK scripts. See following list of possible options:

- `"arab"`: Arabic script
- `"armn"`: Armenian script
- `"beng"`: Bengali script
- `"cyrl"`: Cyrillic script
- `"deva"`: Devanagari script
- `"grek"`: Greek script
- `"gujr"`: Gujarati script
- `"guru"`: Gurmukhi script
- `"hani"`: Han script (used for Chinese characters)
- `"hans"`: Simplified Han script
- `"hant"`: Traditional Han script
- `"hebr"`: Hebrew script
- `"hrkt"`: Hiragana and Katakana scripts
- `"kana"`: Katakana script
- `"hang"`: Hangul script (Korean)
- `"jpan"`: Japanese script (combination of Kanji, Hiragana, Katakana)
- `"knda"`: Kannada script
- `"latn"`: Latin script
- `"mlym"`: Malayalam script
- `"orya"`: Oriya script
- `"taml"`: Tamil script
- `"telg"`: Telugu script
- `"thai"`: Thai script
1. `han` Token Filter

The `han` token filter is used to handle Han characters, which are the logograms used in the written languages of China, Japan, and Korea.
The filter can help in text processing tasks like tokenizing, normalizing, or stemming text written in Chinese, Japanese Kanji, or Korean Hanja.

Check failure on line 24 in _analyzers/token-filters/cjk-bigram.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Hanja. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Hanja. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_analyzers/token-filters/cjk-bigram.md", "range": {"start": {"line": 24, "column": 143}}}, "severity": "ERROR"}

2. `hangul` Token Filter

The `hangul` token filter is specific to the Hangul script, which is the alphabet used to write the Korean language.
This filter is useful for processing Korean text by handling Hangul syllables, which are unique to Korean and do not exist in other East Asian scripts.

3. `hiragana` Token Filter

The `hiragana` token filter is used for processing Hiragana, one of the two syllabaries used in the Japanese writing system.
Hiragana is typically used for native Japanese words, grammatical elements, and certain forms of punctuation.

4. `katakana` Token Filter

The `katakana` token filter is for Katakana, the other syllabary used in Japanese.
Katakana is mainly used for foreign loanwords, onomatopoeia, scientific names, and certain Japanese words.


### `output_unigrams`

Expand Down

0 comments on commit 995516c

Please sign in to comment.