From 0ef58bae229aea175d9413f175e03944a4e78a31 Mon Sep 17 00:00:00 2001 From: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> Date: Wed, 16 Oct 2024 00:57:43 +0100 Subject: [PATCH] [DOC] Character filter - Index & HTML Strip (#8206) * Addition of Character filter documentation Signed-off-by: leanne.laceybyrne@eliatra.com Signed-off-by: leanne.laceybyrne@eliatra.com * Updates for readability and flow. Signed-off-by: leanne.laceybyrne@eliatra.com * Addding and testing examples for the HTML analyser page. Signed-off-by: leanne.laceybyrne@eliatra.com * updates for custom analyzer Signed-off-by: leanne.laceybyrne@eliatra.com * correcting page flow and headings Signed-off-by: leanne.laceybyrne@eliatra.com * updating nav order Signed-off-by: leanne.laceybyrne@eliatra.com * Update _analyzers/character-filters/html-character-filter.md Signed-off-by: Melissa Vagi * Update _analyzers/character-filters/html-character-filter.md Signed-off-by: Melissa Vagi * Update _analyzers/character-filters/html-character-filter.md Signed-off-by: Melissa Vagi * Update _analyzers/character-filters/html-character-filter.md Signed-off-by: Melissa Vagi * Doc review completed Signed-off-by: Melissa Vagi * Doc review completed Signed-off-by: Melissa Vagi * Update _analyzers/character-filters/index.md Signed-off-by: Melissa Vagi * review comments addressed Signed-off-by: leanne.laceybyrne@eliatra.com * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> * apply review suggestions Signed-off-by: leanne.laceybyrne@eliatra.com * Update _analyzers/character-filters/html-character-filter.md Signed-off-by: Melissa Vagi * Update _analyzers/character-filters/index.md Signed-off-by: Melissa Vagi --------- Signed-off-by: leanne.laceybyrne@eliatra.com Signed-off-by: Melissa Vagi Signed-off-by: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> Co-authored-by: Melissa Vagi Co-authored-by: Nathan Bower --- .../html-character-filter.md | 124 ++++++++++++++++++ _analyzers/character-filters/index.md | 19 +++ 2 files changed, 143 insertions(+) create mode 100644 _analyzers/character-filters/html-character-filter.md create mode 100644 _analyzers/character-filters/index.md diff --git a/_analyzers/character-filters/html-character-filter.md b/_analyzers/character-filters/html-character-filter.md new file mode 100644 index 0000000000..9fb98d9744 --- /dev/null +++ b/_analyzers/character-filters/html-character-filter.md @@ -0,0 +1,124 @@ +--- +layout: default +title: html_strip character filter +parent: Character filters +nav_order: 100 +--- + +# `html_strip` character filter + +The `html_strip` character filter removes HTML tags, such as `
`, `

`, and ``, from the input text and renders plain text. The filter can be configured to preserve certain tags or decode specific HTML entities, such as ` `, into spaces. + +## Example: HTML analyzer + +```json +GET /_analyze +{ + "tokenizer": "keyword", + "char_filter": [ + "html_strip" + ], + "text": "

Commonly used calculus symbols include α, β and θ

" +} +``` +{% include copy-curl.html %} + +Using the HTML analyzer, you can convert the HTML character entity references into their corresponding symbols. The processed text would read as follows: + +``` +Commonly used calculus symbols include α, β and θ +``` + +## Example: Custom analyzer with lowercase filter + +The following example query creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter: + +```json +PUT /html_strip_and_lowercase_analyzer +{ + "settings": { + "analysis": { + "char_filter": { + "html_filter": { + "type": "html_strip" + } + }, + "analyzer": { + "html_strip_analyzer": { + "type": "custom", + "char_filter": ["html_filter"], + "tokenizer": "standard", + "filter": ["lowercase"] + } + } + } + } +} +``` +{% include copy-curl.html %} + +### Testing `html_strip_and_lowercase_analyzer` + +You can run the following request to test the analyzer: + +```json +GET /html_strip_and_lowercase_analyzer/_analyze +{ + "analyzer": "html_strip_analyzer", + "text": "

Welcome to OpenSearch!

" +} +``` +{% include copy-curl.html %} + +In the response, the HTML tags have been removed and the plain text has been converted to lowercase: + +``` +welcome to opensearch! +``` + +## Example: Custom analyzer that preserves HTML tags + +The following example request creates a custom analyzer that preserves HTML tags: + +```json +PUT /html_strip_preserve_analyzer +{ + "settings": { + "analysis": { + "char_filter": { + "html_filter": { + "type": "html_strip", + "escaped_tags": ["b", "i"] + } + }, + "analyzer": { + "html_strip_analyzer": { + "type": "custom", + "char_filter": ["html_filter"], + "tokenizer": "keyword" + } + } + } + } +} +``` +{% include copy-curl.html %} + +### Testing `html_strip_preserve_analyzer` + +You can run the following request to test the analyzer: + +```json +GET /html_strip_preserve_analyzer/_analyze +{ + "analyzer": "html_strip_analyzer", + "text": "

This is a bold and italic text.

" +} +``` +{% include copy-curl.html %} + +In the response, the `italic` and `bold` tags have been retained, as specified in the custom analyzer request: + +``` +This is a bold and italic text. +``` diff --git a/_analyzers/character-filters/index.md b/_analyzers/character-filters/index.md new file mode 100644 index 0000000000..0e2ce01b8c --- /dev/null +++ b/_analyzers/character-filters/index.md @@ -0,0 +1,19 @@ +--- +layout: default +title: Character filters +nav_order: 90 +has_children: true +has_toc: false +--- + +# Character filters + +Character filters process text before tokenization to prepare it for further analysis. + +Unlike token filters, which operate on tokens (words or terms), character filters process the raw input text before tokenization. They are especially useful for cleaning or transforming structured text containing unwanted characters, such as HTML tags or special symbols. Character filters help to strip or replace these elements so that text is properly formatted for analysis. + +Use cases for character filters include: + +- **HTML stripping:** Removes HTML tags from content so that only the plain text is indexed. +- **Pattern replacement:** Replaces or removes unwanted characters or patterns in text, for example, converting hyphens to spaces. +- **Custom mappings:** Substitutes specific characters or sequences with other values, for example, to convert currency symbols into their textual equivalents.