From a213742c0485e7769cfffd27b80e7e6cb401da82 Mon Sep 17 00:00:00 2001 From: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> Date: Fri, 3 Jan 2025 18:06:58 +0000 Subject: [PATCH 01/10] [DOC] Tokenizer - Keyword (#8396) * keyword tokenizer Signed-off-by: leanne.laceybyrne@eliatra.com * review comments ammended for page layout Signed-off-by: leanne.laceybyrne@eliatra.com * adding an intro before example Signed-off-by: leanne.laceybyrne@eliatra.com * Update keyword-tokenizers.md Signed-off-by: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> * Doc review Signed-off-by: Fanit Kolchina * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: leanne.laceybyrne@eliatra.com Signed-off-by: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- _analyzers/tokenizers/keyword.md | 119 +++++++++++++++++++++++++++++++ 1 file changed, 119 insertions(+) create mode 100644 _analyzers/tokenizers/keyword.md diff --git a/_analyzers/tokenizers/keyword.md b/_analyzers/tokenizers/keyword.md new file mode 100644 index 0000000000..8b77d38ca5 --- /dev/null +++ b/_analyzers/tokenizers/keyword.md @@ -0,0 +1,119 @@ +--- +layout: default +title: Keyword +parent: Tokenizers +nav_order: 50 +--- + +# Keyword tokenizer + +The `keyword` tokenizer ingests text and outputs it exactly as a single, unaltered token. This makes it particularly useful when you want the input to remain intact, such as when managing structured data like names, product codes, or email addresses. + +The `keyword` tokenizer can be paired with token filters to process the text, for example, to normalize it or to remove extraneous characters. + +## Example usage + +The following example request creates a new index named `my_index` and configures an analyzer with a `keyword` tokenizer: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_keyword_analyzer": { + "type": "custom", + "tokenizer": "keyword" + } + } + } + }, + "mappings": { + "properties": { + "content": { + "type": "text", + "analyzer": "my_keyword_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_keyword_analyzer", + "text": "OpenSearch Example" +} +``` +{% include copy-curl.html %} + +The response contains the single token representing the original text: + +```json +{ + "tokens": [ + { + "token": "OpenSearch Example", + "start_offset": 0, + "end_offset": 18, + "type": "word", + "position": 0 + } + ] +} +``` + +## Parameters + +The `keyword` token filter can be configured with the following parameter. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`buffer_size`| Optional | Integer | Determines the character buffer size. Default is `256`. There is usually no need to change this setting. + +## Combining the keyword tokenizer with token filters + +To enhance the functionality of the `keyword` tokenizer, you can combine it with token filters. Token filters can transform the text, such as converting it to lowercase or removing unwanted characters. + +### Example: Using the pattern_replace filter and keyword tokenizer + +In this example, the `pattern_replace` filter uses a regular expression to replace all non-alphanumeric characters with an empty string: + +```json +POST _analyze +{ + "tokenizer": "keyword", + "filter": [ + { + "type": "pattern_replace", + "pattern": "[^a-zA-Z0-9]", + "replacement": "" + } + ], + "text": "Product#1234-XYZ" +} +``` +{% include copy-curl.html %} + +The `pattern_replace` filter removes non-alphanumeric characters and returns the following token: + +```json +{ + "tokens": [ + { + "token": "Product1234XYZ", + "start_offset": 0, + "end_offset": 16, + "type": "word", + "position": 0 + } + ] +} +``` + From a320d05d02ae82972e7fa9a5332ff765942dd2bc Mon Sep 17 00:00:00 2001 From: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> Date: Fri, 3 Jan 2025 19:42:06 +0000 Subject: [PATCH 02/10] [DOC] Character filters - Mapping (#8556) * doc: Adding mapping character filter page and update to title on HTML strip Signed-off-by: leanne.laceybyrne@eliatra.com * Doc review Signed-off-by: Fanit Kolchina * Update _analyzers/character-filters/mapping-character-filter.md Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: leanne.laceybyrne@eliatra.com Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Melissa Vagi Co-authored-by: Fanit Kolchina Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- .../html-character-filter.md | 78 +++++++++-- .../mapping-character-filter.md | 124 ++++++++++++++++++ 2 files changed, 190 insertions(+), 12 deletions(-) create mode 100644 _analyzers/character-filters/mapping-character-filter.md diff --git a/_analyzers/character-filters/html-character-filter.md b/_analyzers/character-filters/html-character-filter.md index ef55930bdf..eee548d0f7 100644 --- a/_analyzers/character-filters/html-character-filter.md +++ b/_analyzers/character-filters/html-character-filter.md @@ -11,6 +11,8 @@ The `html_strip` character filter removes HTML tags, such as `
`, `

`, and ## Example: HTML analyzer +The following request applies an `html_strip` character filter to the provided text: + ```json GET /_analyze { @@ -23,15 +25,35 @@ GET /_analyze ``` {% include copy-curl.html %} -Using the HTML analyzer, you can convert the HTML character entity references into their corresponding symbols. The processed text would read as follows: +The response contains the token in which HTML characters have been converted to their decoded values: -``` +```json +{ + "tokens": [ + { + "token": """ Commonly used calculus symbols include α, β and θ +""", + "start_offset": 0, + "end_offset": 74, + "type": "word", + "position": 0 + } + ] +} ``` +## Parameters + +The `html_strip` character filter can be configured with the following parameter. + +| Parameter | Required/Optional | Data type | Description | +|:---|:---|:---|:---| +| `escaped_tags` | Optional | Array of strings | An array of HTML element names, specified without the enclosing angle brackets (`< >`). The filter does not remove elements in this list when stripping HTML from the text. For example, setting the array to `["b", "i"]` will prevent the `` and `` elements from being stripped.| + ## Example: Custom analyzer with lowercase filter -The following example query creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter: +The following example request creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter: ```json PUT /html_strip_and_lowercase_analyzer @@ -57,9 +79,7 @@ PUT /html_strip_and_lowercase_analyzer ``` {% include copy-curl.html %} -### Testing `html_strip_and_lowercase_analyzer` - -You can run the following request to test the analyzer: +Use the following request to examine the tokens generated using the analyzer: ```json GET /html_strip_and_lowercase_analyzer/_analyze @@ -72,8 +92,32 @@ GET /html_strip_and_lowercase_analyzer/_analyze In the response, the HTML tags have been removed and the plain text has been converted to lowercase: -``` -welcome to opensearch! +```json +{ + "tokens": [ + { + "token": "welcome", + "start_offset": 4, + "end_offset": 11, + "type": "", + "position": 0 + }, + { + "token": "to", + "start_offset": 12, + "end_offset": 14, + "type": "", + "position": 1 + }, + { + "token": "opensearch", + "start_offset": 23, + "end_offset": 42, + "type": "", + "position": 2 + } + ] +} ``` ## Example: Custom analyzer that preserves HTML tags @@ -104,9 +148,7 @@ PUT /html_strip_preserve_analyzer ``` {% include copy-curl.html %} -### Testing `html_strip_preserve_analyzer` - -You can run the following request to test the analyzer: +Use the following request to examine the tokens generated using the analyzer: ```json GET /html_strip_preserve_analyzer/_analyze @@ -119,6 +161,18 @@ GET /html_strip_preserve_analyzer/_analyze In the response, the `italic` and `bold` tags have been retained, as specified in the custom analyzer request: -``` +```json +{ + "tokens": [ + { + "token": """ This is a bold and italic text. +""", + "start_offset": 0, + "end_offset": 52, + "type": "word", + "position": 0 + } + ] +} ``` diff --git a/_analyzers/character-filters/mapping-character-filter.md b/_analyzers/character-filters/mapping-character-filter.md new file mode 100644 index 0000000000..0cd882e52e --- /dev/null +++ b/_analyzers/character-filters/mapping-character-filter.md @@ -0,0 +1,124 @@ +--- +layout: default +title: Mapping +parent: Character filters +nav_order: 120 +--- + +# Mapping character filter + +The `mapping` character filter accepts a map of key-value pairs for character replacement. Whenever the filter encounters a string of characters matching a key, it replaces them with the corresponding value. Replacement values can be empty strings. + +The filter applies greedy matching, meaning that the longest matching pattern is matched. + +The `mapping` character filter helps in scenarios where specific text replacements are required before tokenization. + +## Example + +The following request configures a `mapping` character filter that converts Roman numerals (such as I, II, or III) into their corresponding Arabic numerals (1, 2, and 3): + +```json +GET /_analyze +{ + "tokenizer": "keyword", + "char_filter": [ + { + "type": "mapping", + "mappings": [ + "I => 1", + "II => 2", + "III => 3", + "IV => 4", + "V => 5" + ] + } + ], + "text": "I have III apples and IV oranges" +} +``` + +The response contains a token where Roman numerals have been replaced with Arabic numerals: + +```json +{ + "tokens": [ + { + "token": "1 have 3 apples and 4 oranges", + "start_offset": 0, + "end_offset": 32, + "type": "word", + "position": 0 + } + ] +} +``` +{% include copy-curl.html %} + +## Parameters + +You can use either of the following parameters to configure the key-value map. + +| Parameter | Required/Optional | Data type | Description | +|:---|:---|:---|:---| +| `mappings` | Optional | Array | An array of key-value pairs in the format `key => value`. Each key found in the input text will be replaced with its corresponding value. | +| `mappings_path` | Optional | String | The path to a UTF-8 encoded file containing key-value mappings. Each mapping should appear on a new line in the format `key => value`. The path can be absolute or relative to the OpenSearch configuration directory. | + +### Using a custom mapping character filter + +You can create a custom mapping character filter by defining your own set of mappings. The following request creates a custom character filter that replaces common abbreviations in a text: + +```json +PUT /test-index +{ + "settings": { + "analysis": { + "analyzer": { + "custom_abbr_analyzer": { + "tokenizer": "standard", + "char_filter": [ + "custom_abbr_filter" + ] + } + }, + "char_filter": { + "custom_abbr_filter": { + "type": "mapping", + "mappings": [ + "BTW => By the way", + "IDK => I don't know", + "FYI => For your information" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /text-index/_analyze +{ + "tokenizer": "keyword", + "char_filter": [ "custom_abbr_filter" ], + "text": "FYI, updates to the workout schedule are posted. IDK when it takes effect, but we have some details. BTW, the finalized schedule will be released Monday." +} +``` + +The response shows that the abbreviations were replaced: + +```json +{ + "tokens": [ + { + "token": "For your information, updates to the workout schedule are posted. I don't know when it takes effect, but we have some details. By the way, the finalized schedule will be released Monday.", + "start_offset": 0, + "end_offset": 153, + "type": "word", + "position": 0 + } + ] +} +``` From a3af660ca885e09441453979a1d06867464af7c2 Mon Sep 17 00:00:00 2001 From: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> Date: Fri, 3 Jan 2025 19:42:49 +0000 Subject: [PATCH 03/10] [DOC] Tokenizer - Letter (#8498) * adding page letter tokenizer Signed-off-by: leanne.laceybyrne@eliatra.com * Doc review Signed-off-by: Fanit Kolchina * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: leanne.laceybyrne@eliatra.com Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- _analyzers/tokenizers/letter.md | 97 +++++++++++++++++++++++++++++++++ 1 file changed, 97 insertions(+) create mode 100644 _analyzers/tokenizers/letter.md diff --git a/_analyzers/tokenizers/letter.md b/_analyzers/tokenizers/letter.md new file mode 100644 index 0000000000..ba67a7841d --- /dev/null +++ b/_analyzers/tokenizers/letter.md @@ -0,0 +1,97 @@ +--- +layout: default +title: Letter +parent: Tokenizers +nav_order: 60 +--- + +# Letter tokenizer + +The `letter` tokenizer splits text into words on any non-letter characters. It works well with many European languages but is ineffective with some Asian languages in which words aren't separated by spaces. + +## Example usage + +The following example request creates a new index named `my_index` and configures an analyzer with a `letter` tokenizer: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_letter_analyzer": { + "type": "custom", + "tokenizer": "letter" + } + } + } + }, + "mappings": { + "properties": { + "content": { + "type": "text", + "analyzer": "my_letter_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST _analyze +{ + "tokenizer": "letter", + "text": "Cats 4EVER love chasing butterflies!" +} + +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "Cats", + "start_offset": 0, + "end_offset": 4, + "type": "word", + "position": 0 + }, + { + "token": "EVER", + "start_offset": 6, + "end_offset": 10, + "type": "word", + "position": 1 + }, + { + "token": "love", + "start_offset": 11, + "end_offset": 15, + "type": "word", + "position": 2 + }, + { + "token": "chasing", + "start_offset": 16, + "end_offset": 23, + "type": "word", + "position": 3 + }, + { + "token": "butterflies", + "start_offset": 24, + "end_offset": 35, + "type": "word", + "position": 4 + } + ] +} +``` From 7a5ba5c693705d5bcf2b8a295ce06666c9ff8abe Mon Sep 17 00:00:00 2001 From: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> Date: Mon, 6 Jan 2025 19:38:15 +0000 Subject: [PATCH 04/10] [DOC] Character filters - Pattern replace (#8557) * doc: addition of pattern replace charachter filter page Signed-off-by: leanne.laceybyrne@eliatra.com * Doc review Signed-off-by: Fanit Kolchina * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: leanne.laceybyrne@eliatra.com Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- .../pattern-replace-character-filter.md | 238 ++++++++++++++++++ 1 file changed, 238 insertions(+) create mode 100644 _analyzers/character-filters/pattern-replace-character-filter.md diff --git a/_analyzers/character-filters/pattern-replace-character-filter.md b/_analyzers/character-filters/pattern-replace-character-filter.md new file mode 100644 index 0000000000..87cc93e904 --- /dev/null +++ b/_analyzers/character-filters/pattern-replace-character-filter.md @@ -0,0 +1,238 @@ +--- +layout: default +title: Pattern replace +parent: Character filters +nav_order: 130 +--- + +# Pattern replace character filter + +The `pattern_replace` character filter allows you to use regular expressions to define patterns for matching and replacing characters in the input text. It is a flexible tool for advanced text transformations, especially when dealing with complex string patterns. + +This filter replaces all instances of a pattern with a specified replacement string, allowing for easy substitutions, deletions, or complex modifications of the input text. You can use it to normalize the input before tokenization. + +## Example + +To standardize phone numbers, you'll use the regular expression `[\\s()-]+`: + +- `[ ]`: Defines a **character class**, meaning it will match **any one** of the characters inside the brackets. +- `\\s`: Matches any **white space** character, such as a space, tab, or newline. +- `()`: Matches literal **parentheses** (`(` or `)`). +- `-`: Matches a literal **hyphen** (`-`). +- `+`: Specifies that the pattern should match **one or more** occurrences of the preceding characters. + +The pattern `[\\s()-]+` will match any sequence of one or more white space characters, parentheses, or hyphens and remove it from the input text. This ensures that the phone numbers are normalized and contain only digits. + +The following request standardizes phone numbers by removing spaces, dashes, and parentheses: + +```json +GET /_analyze +{ + "tokenizer": "standard", + "char_filter": [ + { + "type": "pattern_replace", + "pattern": "[\\s()-]+", + "replacement": "" + } + ], + "text": "(555) 123-4567" +} +``` +{% include copy-curl.html %} + +The response contains the generated token: + +```json +{ + "tokens": [ + { + "token": "5551234567", + "start_offset": 1, + "end_offset": 14, + "type": "", + "position": 0 + } + ] +} +``` + +## Parameters + +The `pattern_replace` character filter must be configured with the following parameters. + +| Parameter | Required/Optional | Data type | Description | +|:---|:---| +| `pattern` | Required | String | A regular expression used to match parts of the input text. The filter identifies and matches this pattern to perform replacement. | +| `replacement` | Optional | String | The string that replaces pattern matches. Use an empty string (`""`) to remove the matched text. Default is an empty string (`""`). | + +## Creating a custom analyzer + +The following request creates an index with a custom analyzer configured with a `pattern_replace` character filter. The filter removes currency signs and thousands separators (both European `.` and American `,`) from numbers: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_analyzer": { + "tokenizer": "standard", + "char_filter": [ + "pattern_char_filter" + ] + } + }, + "char_filter": { + "pattern_char_filter": { + "type": "pattern_replace", + "pattern": "[$€,.]", + "replacement": "" + } + } + } + } +} +``` + +{% include copy-curl.html %} + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_analyzer", + "text": "Total: $ 1,200.50 and € 1.100,75" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "Total", + "start_offset": 0, + "end_offset": 5, + "type": "", + "position": 0 + }, + { + "token": "120050", + "start_offset": 9, + "end_offset": 17, + "type": "", + "position": 1 + }, + { + "token": "and", + "start_offset": 18, + "end_offset": 21, + "type": "", + "position": 2 + }, + { + "token": "110075", + "start_offset": 24, + "end_offset": 32, + "type": "", + "position": 3 + } + ] +} +``` + +## Using capturing groups + +You can use capturing groups in the `replacement` parameter. For example, the following request creates a custom analyzer that uses a `pattern_replace` character filter to replace hyphens with dots in phone numbers: + +```json +PUT /my_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_analyzer": { + "tokenizer": "standard", + "char_filter": [ + "pattern_char_filter" + ] + } + }, + "char_filter": { + "pattern_char_filter": { + "type": "pattern_replace", + "pattern": "(\\d+)-(?=\\d)", + "replacement": "$1." + } + } + } + } +} +``` +{% include copy-curl.html %} + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /my_index/_analyze +{ + "analyzer": "my_analyzer", + "text": "Call me at 555-123-4567 or 555-987-6543" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "Call", + "start_offset": 0, + "end_offset": 4, + "type": "", + "position": 0 + }, + { + "token": "me", + "start_offset": 5, + "end_offset": 7, + "type": "", + "position": 1 + }, + { + "token": "at", + "start_offset": 8, + "end_offset": 10, + "type": "", + "position": 2 + }, + { + "token": "555.123.4567", + "start_offset": 11, + "end_offset": 23, + "type": "", + "position": 3 + }, + { + "token": "or", + "start_offset": 24, + "end_offset": 26, + "type": "", + "position": 4 + }, + { + "token": "555.987.6543", + "start_offset": 27, + "end_offset": 39, + "type": "", + "position": 5 + } + ] +} +``` \ No newline at end of file From a66d54e77f1cecdf4ac0f31343763b35787a8276 Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Mon, 6 Jan 2025 15:04:12 -0500 Subject: [PATCH 05/10] Add links and refactor token and character filter section (#9018) * Add links and refactor token and character filter section Signed-off-by: Fanit Kolchina * Add last link Signed-off-by: Fanit Kolchina --------- Signed-off-by: Fanit Kolchina --- .../html-character-filter.md | 2 +- _analyzers/character-filters/index.md | 6 ++-- .../mapping-character-filter.md | 3 +- _analyzers/tokenizers/index.md | 30 +++++++++---------- 4 files changed, 21 insertions(+), 20 deletions(-) diff --git a/_analyzers/character-filters/html-character-filter.md b/_analyzers/character-filters/html-character-filter.md index eee548d0f7..bd9f88583e 100644 --- a/_analyzers/character-filters/html-character-filter.md +++ b/_analyzers/character-filters/html-character-filter.md @@ -9,7 +9,7 @@ nav_order: 100 The `html_strip` character filter removes HTML tags, such as `

`, `

`, and ``, from the input text and renders plain text. The filter can be configured to preserve certain tags or decode specific HTML entities, such as ` `, into spaces. -## Example: HTML analyzer +## Example The following request applies an `html_strip` character filter to the provided text: diff --git a/_analyzers/character-filters/index.md b/_analyzers/character-filters/index.md index 0e2ce01b8c..9d4980ac80 100644 --- a/_analyzers/character-filters/index.md +++ b/_analyzers/character-filters/index.md @@ -14,6 +14,6 @@ Unlike token filters, which operate on tokens (words or terms), character filter Use cases for character filters include: -- **HTML stripping:** Removes HTML tags from content so that only the plain text is indexed. -- **Pattern replacement:** Replaces or removes unwanted characters or patterns in text, for example, converting hyphens to spaces. -- **Custom mappings:** Substitutes specific characters or sequences with other values, for example, to convert currency symbols into their textual equivalents. +- **HTML stripping**: The [`html_strip`]({{site.url}}{{site.baseurl}}/analyzers/character-filters/html-character-filter/) character filter removes HTML tags from content so that only the plain text is indexed. +- **Pattern replacement**: The [`pattern_replace`]({{site.url}}{{site.baseurl}}/analyzers/character-filters/pattern-replace-character-filter/) character filter replaces or removes unwanted characters or patterns in text, for example, converting hyphens to spaces. +- **Custom mappings**: The [`mapping`]({{site.url}}{{site.baseurl}}/analyzers/character-filters/mapping-character-filter/) character filter substitutes specific characters or sequences with other values, for example, to convert currency symbols into their textual equivalents. diff --git a/_analyzers/character-filters/mapping-character-filter.md b/_analyzers/character-filters/mapping-character-filter.md index 0cd882e52e..59e516e4ec 100644 --- a/_analyzers/character-filters/mapping-character-filter.md +++ b/_analyzers/character-filters/mapping-character-filter.md @@ -36,6 +36,7 @@ GET /_analyze "text": "I have III apples and IV oranges" } ``` +{% include copy-curl.html %} The response contains a token where Roman numerals have been replaced with Arabic numerals: @@ -52,7 +53,6 @@ The response contains a token where Roman numerals have been replaced with Arabi ] } ``` -{% include copy-curl.html %} ## Parameters @@ -106,6 +106,7 @@ GET /text-index/_analyze "text": "FYI, updates to the workout schedule are posted. IDK when it takes effect, but we have some details. BTW, the finalized schedule will be released Monday." } ``` +{% include copy-curl.html %} The response shows that the abbreviations were replaced: diff --git a/_analyzers/tokenizers/index.md b/_analyzers/tokenizers/index.md index f5b5ff0f25..cef1429778 100644 --- a/_analyzers/tokenizers/index.md +++ b/_analyzers/tokenizers/index.md @@ -30,13 +30,13 @@ Word tokenizers parse full text into words. Tokenizer | Description | Example :--- | :--- | :--- -`standard` | - Parses strings into tokens at word boundaries
- Removes most punctuation | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`
becomes
[`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`] -`letter` | - Parses strings into tokens on any non-letter character
- Removes non-letter characters | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`
becomes
[`It`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `to`, `OpenSearch`] -`lowercase` | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Converts terms to lowercase | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`
becomes
[`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`] -`whitespace` | - Parses strings into tokens at white space characters | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`
becomes
[`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`] -`uax_url_email` | - Similar to the standard tokenizer
- Unlike the standard tokenizer, leaves URLs and email addresses as single terms | `It’s fun to contribute a brand-new PR or 2 to OpenSearch opensearch-project@github.com!`
becomes
[`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`, `opensearch-project@github.com`] -`classic` | - Parses strings into tokens on:
  - Punctuation characters that are followed by a white space character
  - Hyphens if the term does not contain numbers
- Removes punctuation
- Leaves URLs and email addresses as single terms | `Part number PA-35234, single-use product (128.32)`
becomes
[`Part`, `number`, `PA-35234`, `single`, `use`, `product`, `128.32`] -`thai` | - Parses Thai text into terms | `สวัสดีและยินดีต`
becomes
[`สวัสด`, `และ`, `ยินดี`, `ต`] +[`standard`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/standard/) | - Parses strings into tokens at word boundaries
- Removes most punctuation | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`
becomes
[`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`] +[`letter`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/letter/) | - Parses strings into tokens on any non-letter character
- Removes non-letter characters | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`
becomes
[`It`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `to`, `OpenSearch`] +[`lowercase`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/lowercase/) | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Converts terms to lowercase | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`
becomes
[`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`] +[`whitespace`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/whitespace/) | - Parses strings into tokens at white space characters | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`
becomes
[`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`] +[`uax_url_email`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/uax-url-email/) | - Similar to the standard tokenizer
- Unlike the standard tokenizer, leaves URLs and email addresses as single terms | `It’s fun to contribute a brand-new PR or 2 to OpenSearch opensearch-project@github.com!`
becomes
[`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`, `opensearch-project@github.com`] +[`classic`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/classic/) | - Parses strings into tokens on:
  - Punctuation characters that are followed by a white space character
  - Hyphens if the term does not contain numbers
- Removes punctuation
- Leaves URLs and email addresses as single terms | `Part number PA-35234, single-use product (128.32)`
becomes
[`Part`, `number`, `PA-35234`, `single`, `use`, `product`, `128.32`] +[`thai`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/thai/) | - Parses Thai text into terms | `สวัสดีและยินดีต`
becomes
[`สวัสด`, `และ`, `ยินดี`, `ต`] ### Partial word tokenizers @@ -44,8 +44,8 @@ Partial word tokenizers parse text into words and generate fragments of those wo Tokenizer | Description | Example :--- | :--- | :--- -`ngram`| - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates n-grams of each word | `My repo`
becomes
[`M`, `My`, `y`, `y `,  ,  r, `r`, `re`, `e`, `ep`, `p`, `po`, `o`]
because the default n-gram length is 1--2 characters -`edge_ngram` | - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates edge n-grams of each word (n-grams that start at the beginning of the word) | `My repo`
becomes
[`M`, `My`]
because the default n-gram length is 1--2 characters +[`ngram`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/ngram/)| - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates n-grams of each word | `My repo`
becomes
[`M`, `My`, `y`, `y `,  ,  r, `r`, `re`, `e`, `ep`, `p`, `po`, `o`]
because the default n-gram length is 1--2 characters +[`edge_ngram`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/edge-n-gram/) | - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates edge n-grams of each word (n-grams that start at the beginning of the word) | `My repo`
becomes
[`M`, `My`]
because the default n-gram length is 1--2 characters ### Structured text tokenizers @@ -53,11 +53,11 @@ Structured text tokenizers parse structured text, such as identifiers, email add Tokenizer | Description | Example :--- | :--- | :--- -`keyword` | - No-op tokenizer
- Outputs the entire string unchanged
- Can be combined with token filters, like lowercase, to normalize terms | `My repo`
becomes
`My repo` -`pattern` | - Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms
- Uses [Java regular expressions](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) | `https://opensearch.org/forum`
becomes
[`https`, `opensearch`, `org`, `forum`] because by default the tokenizer splits terms at word boundaries (`\W+`)
Can be configured with a regex pattern -`simple_pattern` | - Uses a regular expression pattern to return matching text as terms
- Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html)
- Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | Returns an empty array by default
Must be configured with a pattern because the pattern defaults to an empty string -`simple_pattern_split` | - Uses a regular expression pattern to split the text on matches rather than returning the matches as terms
- Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html)
- Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | No-op by default
Must be configured with a pattern -`char_group` | - Parses on a set of configurable characters
- Faster than tokenizers that run regular expressions | No-op by default
Must be configured with a list of characters -`path_hierarchy` | - Parses text on the path separator (by default, `/`) and returns a full path to each component in the tree hierarchy | `one/two/three`
becomes
[`one`, `one/two`, `one/two/three`] +[`keyword`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/keyword/) | - No-op tokenizer
- Outputs the entire string unchanged
- Can be combined with token filters, like lowercase, to normalize terms | `My repo`
becomes
`My repo` +[`pattern`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/pattern/) | - Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms
- Uses [Java regular expressions](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) | `https://opensearch.org/forum`
becomes
[`https`, `opensearch`, `org`, `forum`] because by default the tokenizer splits terms at word boundaries (`\W+`)
Can be configured with a regex pattern +[`simple_pattern`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/simple-pattern/) | - Uses a regular expression pattern to return matching text as terms
- Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html)
- Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | Returns an empty array by default
Must be configured with a pattern because the pattern defaults to an empty string +[`simple_pattern_split`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/simple-pattern-split/) | - Uses a regular expression pattern to split the text on matches rather than returning the matches as terms
- Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html)
- Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | No-op by default
Must be configured with a pattern +[`char_group`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/character-group/) | - Parses on a set of configurable characters
- Faster than tokenizers that run regular expressions | No-op by default
Must be configured with a list of characters +[`path_hierarchy`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/path-hierarchy/) | - Parses text on the path separator (by default, `/`) and returns a full path to each component in the tree hierarchy | `one/two/three`
becomes
[`one`, `one/two`, `one/two/three`] From ad8781a661f850891faa44c53b6dab0eb20c205e Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Mon, 6 Jan 2025 15:56:18 -0600 Subject: [PATCH 06/10] Fix index mapping parameter option (#9023) Signed-off-by: Archer --- _field-types/mapping-parameters/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_field-types/mapping-parameters/index.md b/_field-types/mapping-parameters/index.md index ca5586bb8f..a9b60c03d4 100644 --- a/_field-types/mapping-parameters/index.md +++ b/_field-types/mapping-parameters/index.md @@ -24,5 +24,5 @@ Parameter | Description `format` | Specifies the date format for date fields. There is no default value for this parameter. Allowed values are any valid date format string, such as `yyyy-MM-dd` or `epoch_millis`. `ignore_above` | Skips indexing values that exceed the specified length. Default value is `2147483647`, which means that there is no limit on the field value length. Allowed values are any positive integer. `ignore_malformed` | Specifies whether malformed values should be ignored. Default value is `false`, which means that malformed values are not ignored. Allowed values are `true` or `false`. -`index` | Specifies whether a field should be indexed. Default value is `true`, which means that the field is indexed. Allowed values are `true`, `false`, or `not_analyzed`. +`index` | Specifies whether a field should be indexed. Default value is `true`, which means that the field is indexed. Allowed values are `true` or `false`. `index_options` | Specifies what information should be stored in an index for scoring purposes. Default value is `docs`, which means that only the document numbers are stored in the index. Allowed values are `docs`, `freqs`, `positions`, or `offsets`. \ No newline at end of file From 984e2c3bc149f60d29f735ff29026577df57c47f Mon Sep 17 00:00:00 2001 From: Andre Kurait Date: Tue, 7 Jan 2025 12:55:33 -0600 Subject: [PATCH 07/10] Add supported regions for Migration Assistant (#9017) * Add supported regions for Migration Assistant Signed-off-by: Andre Kurait * Update is-migration-assistant-right-for-you.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Andre Kurait Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- .../is-migration-assistant-right-for-you.md | 29 +++++++++++++++---- 1 file changed, 24 insertions(+), 5 deletions(-) diff --git a/_migration-assistant/is-migration-assistant-right-for-you.md b/_migration-assistant/is-migration-assistant-right-for-you.md index 073c2b6cd7..77f3c6a806 100644 --- a/_migration-assistant/is-migration-assistant-right-for-you.md +++ b/_migration-assistant/is-migration-assistant-right-for-you.md @@ -36,6 +36,25 @@ There are also tools available for migrating cluster configuration, templates, a The tooling is designed to work with other cloud provider platforms, but it is not officially tested with these other platforms. If you would like to add support, please contact one of the maintainers on [GitHub](https://github.com/opensearch-project/opensearch-migrations/blob/main/MAINTAINERS.md). +### Supported AWS regions + +Migration Assistant supports the following AWS regions: + +- US East (N. Virginia) +- US East (Ohio) +- US West (Oregon) +- US West (N. California) +- Europe (Frankfurt) +- Europe (Ireland) +- Europe (London) +- Asia Pacific (Tokyo) +- Asia Pacific (Singapore) +- Asia Pacific (Sydney) +- AWS GovCloud (US-East)[^1] +- AWS GovCloud (US-West)[^1] + +[^1]: GovCloud does not support `reindex-from-snapshot` (RFS) shard sizes above 80GiB. Ensure your shard sizes are within this limit when planning migrations with RFS in the listed GovCloud regions. + ### Future migration paths To see the OpenSearch migrations roadmap, go to [OpenSearch Migrations - Roadmap](https://github.com/orgs/opensearch-project/projects/229/views/1). @@ -50,9 +69,9 @@ Before starting a migration, consider the scope of the components involved. The | **Index settings** | Yes | Migrate with the metadata migration tool. | | **Index mappings** | Yes | Migrate with the metadata migration tool. | | **Index templates** | Yes | Migrate with the metadata migration tool. | -| **Component templates** | Yes | Migrate with the metadata migration tool. | -| **Aliases** | Yes | Migrate with the metadata migration tool. | -| **Index State Management (ISM) policies** | Expected in 2025 | Manually migrate using an API. | +| **Component templates** | Yes | Migrate with the metadata migration tool. | +| **Aliases** | Yes | Migrate with the metadata migration tool. | +| **Index State Management (ISM) policies** | Expected in 2025 | Manually migrate using an API. | | **Elasticsearch Kibana dashboards** | Expected in 2025 | This tool is only needed when used to migrate Elasticsearch Kibana Dashboards to OpenSearch Dashboards. To start, export JSON files from Kibana and import them into OpenSearch Dashboards; before importing, use the [`dashboardsSanitizer`](https://github.com/opensearch-project/opensearch-migrations/tree/main/dashboardsSanitizer) tool on X-Pack visualizations like Canvas and Lens in Kibana Dashboards, as they may require recreation for compatibility with OpenSearch. | -| **Security constructs** | No | Configure roles and permissions based on cloud provider recommendations. For example, if using AWS, leverage AWS Identity and Access Management (IAM) for enhanced security management. | -| **Plugins** | No | Check plugin compatibility; some Elasticsearch plugins may not have direct equivalents in OpenSearch. | +| **Security constructs** | No | Configure roles and permissions based on cloud provider recommendations. For example, if using AWS, leverage AWS Identity and Access Management (IAM) for enhanced security management. | +| **Plugins** | No | Check plugin compatibility; some Elasticsearch plugins may not have direct equivalents in OpenSearch. | From b131fc917e443f92d7e190727fd03ec15abb0565 Mon Sep 17 00:00:00 2001 From: Saliha <49085460+Saliha067@users.noreply.github.com> Date: Wed, 8 Jan 2025 23:45:55 +0900 Subject: [PATCH 08/10] Update permissions.md (#9031) Signed-off-by: Saliha <49085460+Saliha067@users.noreply.github.com> --- _security/access-control/permissions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_security/access-control/permissions.md b/_security/access-control/permissions.md index 5a75a0a5a7..bacc49fe20 100644 --- a/_security/access-control/permissions.md +++ b/_security/access-control/permissions.md @@ -528,7 +528,7 @@ These permissions apply to an index or index pattern. You might want a user to h | `indices:monitor/data_stream/stats` | Permission to stream stats. | | `indices:monitor/recovery` | Permission to access recovery stats. | | `indices:monitor/segments` | Permission to access segment stats. | -| `indices:monitor/settings/get` | Permission to get mointor settings. | +| `indices:monitor/settings/get` | Permission to get monitor settings. | | `indices:monitor/shard_stores` | Permission to access shard store stats. | | `indices:monitor/stats` | Permission to access monitoring stats. | | `indices:monitor/upgrade` | Permission to access upgrade stats. | From e3ab7eb92236e1a0bea72edb4c639598685c0495 Mon Sep 17 00:00:00 2001 From: Shawshark Date: Wed, 8 Jan 2025 07:33:48 -0800 Subject: [PATCH 09/10] Update knn-vector-quantization.md (#8985) Signed-off-by: Shawshark --- _search-plugins/knn/knn-vector-quantization.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_search-plugins/knn/knn-vector-quantization.md b/_search-plugins/knn/knn-vector-quantization.md index a911dc91c9..86984972ee 100644 --- a/_search-plugins/knn/knn-vector-quantization.md +++ b/_search-plugins/knn/knn-vector-quantization.md @@ -359,7 +359,7 @@ PUT my-vector-index "mode": "on_disk", "compression_level": "16x", "method": { - "params": { + "parameters": { "ef_construction": 16 } } From 9439bb87d1bbabecf8f1b8a7ab1f9bda9d60b11b Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Wed, 8 Jan 2025 10:43:36 -0500 Subject: [PATCH 10/10] Update knn-vector-quantization.md (#9035) Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --- _search-plugins/knn/knn-vector-quantization.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_search-plugins/knn/knn-vector-quantization.md b/_search-plugins/knn/knn-vector-quantization.md index 86984972ee..2e516b0b8d 100644 --- a/_search-plugins/knn/knn-vector-quantization.md +++ b/_search-plugins/knn/knn-vector-quantization.md @@ -384,7 +384,7 @@ PUT my-vector-index "name": "hnsw", "engine": "faiss", "space_type": "l2", - "params": { + "parameters": { "m": 16, "ef_construction": 512, "encoder": {