From 995516c9d8761db9e75b3c4a64bd7cb6313b0f72 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Fri, 9 Aug 2024 09:27:43 +0100 Subject: [PATCH] updating the configs Signed-off-by: AntonEliatra --- _analyzers/token-filters/cjk-bigram.md | 43 ++++++++++++-------------- 1 file changed, 20 insertions(+), 23 deletions(-) diff --git a/_analyzers/token-filters/cjk-bigram.md b/_analyzers/token-filters/cjk-bigram.md index 933e6e55c8..1cf6bf2a05 100644 --- a/_analyzers/token-filters/cjk-bigram.md +++ b/_analyzers/token-filters/cjk-bigram.md @@ -18,29 +18,26 @@ The `cjk_bigram` token filter can be additionally configured with two parameters This option allows you to specify whether the filter should ignore certain scripts (like Latin, Cyrillic) and only tokenize CJK text into bigrams. The default is to ignore non-CJK scripts. See following list of possible options: - - `"arab"`: Arabic script - - `"armn"`: Armenian script - - `"beng"`: Bengali script - - `"cyrl"`: Cyrillic script - - `"deva"`: Devanagari script - - `"grek"`: Greek script - - `"gujr"`: Gujarati script - - `"guru"`: Gurmukhi script - - `"hani"`: Han script (used for Chinese characters) - - `"hans"`: Simplified Han script - - `"hant"`: Traditional Han script - - `"hebr"`: Hebrew script - - `"hrkt"`: Hiragana and Katakana scripts - - `"kana"`: Katakana script - - `"hang"`: Hangul script (Korean) - - `"jpan"`: Japanese script (combination of Kanji, Hiragana, Katakana) - - `"knda"`: Kannada script - - `"latn"`: Latin script - - `"mlym"`: Malayalam script - - `"orya"`: Oriya script - - `"taml"`: Tamil script - - `"telg"`: Telugu script - - `"thai"`: Thai script +1. `han` Token Filter + + The `han` token filter is used to handle Han characters, which are the logograms used in the written languages of China, Japan, and Korea. + The filter can help in text processing tasks like tokenizing, normalizing, or stemming text written in Chinese, Japanese Kanji, or Korean Hanja. + +2. `hangul` Token Filter + + The `hangul` token filter is specific to the Hangul script, which is the alphabet used to write the Korean language. + This filter is useful for processing Korean text by handling Hangul syllables, which are unique to Korean and do not exist in other East Asian scripts. + +3. `hiragana` Token Filter + + The `hiragana` token filter is used for processing Hiragana, one of the two syllabaries used in the Japanese writing system. + Hiragana is typically used for native Japanese words, grammatical elements, and certain forms of punctuation. + +4. `katakana` Token Filter + + The `katakana` token filter is for Katakana, the other syllabary used in Japanese. + Katakana is mainly used for foreign loanwords, onomatopoeia, scientific names, and certain Japanese words. + ### `output_unigrams`