From 0ef58bae229aea175d9413f175e03944a4e78a31 Mon Sep 17 00:00:00 2001
From: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com>
Date: Wed, 16 Oct 2024 00:57:43 +0100
Subject: [PATCH] [DOC] Character filter - Index & HTML Strip (#8206)

* Addition of Character filter documentation
Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

* Updates for readability and flow.

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

* Addding and testing examples for the HTML analyser page.

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

* updates for custom analyzer

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

* correcting page flow and headings

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

* updating nav order

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

* Update _analyzers/character-filters/html-character-filter.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Update _analyzers/character-filters/html-character-filter.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Update _analyzers/character-filters/html-character-filter.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Update _analyzers/character-filters/html-character-filter.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Doc review completed

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Doc review completed

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Update _analyzers/character-filters/index.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* review comments addressed

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com>

* apply review suggestions

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

* Update _analyzers/character-filters/html-character-filter.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Update _analyzers/character-filters/index.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

---------

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com>
Co-authored-by: Melissa Vagi <vagimeli@amazon.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
---
 .../html-character-filter.md                  | 124 ++++++++++++++++++
 _analyzers/character-filters/index.md         |  19 +++
 2 files changed, 143 insertions(+)
 create mode 100644 _analyzers/character-filters/html-character-filter.md
 create mode 100644 _analyzers/character-filters/index.md
diff --git a/_analyzers/character-filters/html-character-filter.md b/_analyzers/character-filters/html-character-filter.md
new file mode 100644
index 0000000000..9fb98d9744
--- /dev/null
+++ b/_analyzers/character-filters/html-character-filter.md
@@ -0,0 +1,124 @@
+---
+layout: default
+title: html_strip character filter
+parent: Character filters
+nav_order: 100
+---
+
+# `html_strip` character filter
+
+The `html_strip` character filter removes HTML tags, such as `<div>`, `<p>`, and `<a>`, from the input text and renders plain text. The filter can be configured to preserve certain tags or decode specific HTML entities, such as `&nbsp;`, into spaces.
+
+## Example: HTML analyzer
+
+```json
+GET /_analyze
+{
+  "tokenizer": "keyword",
+  "char_filter": [
+    "html_strip"
+  ],
+  "text": "<p>Commonly used calculus symbols include &alpha;, &beta; and &theta; </p>"
+}
+```
+{% include copy-curl.html %}
+
+Using the HTML analyzer, you can convert the HTML character entity references into their corresponding symbols. The processed text would read as follows:
+
+```
+Commonly used calculus symbols include α, β and θ 
+```
+
+## Example: Custom analyzer with lowercase filter
+
+The following example query creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter:
+
+```json
+PUT /html_strip_and_lowercase_analyzer
+{
+  "settings": {
+    "analysis": {
+      "char_filter": {
+        "html_filter": {
+          "type": "html_strip"
+        }
+      },
+      "analyzer": {
+        "html_strip_analyzer": {
+          "type": "custom",
+          "char_filter": ["html_filter"],
+          "tokenizer": "standard",
+          "filter": ["lowercase"]
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+### Testing `html_strip_and_lowercase_analyzer`
+
+You can run the following request to test the analyzer:
+
+```json
+GET /html_strip_and_lowercase_analyzer/_analyze
+{
+  "analyzer": "html_strip_analyzer",
+  "text": "<h1>Welcome to <strong>OpenSearch</strong>!</h1>"
+}
+```
+{% include copy-curl.html %}
+
+In the response, the HTML tags have been removed and the plain text has been converted to lowercase:
+
+```
+welcome to opensearch!
+```
+
+## Example: Custom analyzer that preserves HTML tags
+
+The following example request creates a custom analyzer that preserves HTML tags:
+
+```json
+PUT /html_strip_preserve_analyzer
+{
+  "settings": {
+    "analysis": {
+      "char_filter": {
+        "html_filter": {
+          "type": "html_strip",
+          "escaped_tags": ["b", "i"]
+        }
+      },
+      "analyzer": {
+        "html_strip_analyzer": {
+          "type": "custom",
+          "char_filter": ["html_filter"],
+          "tokenizer": "keyword"
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+### Testing `html_strip_preserve_analyzer`  
+
+You can run the following request to test the analyzer:
+
+```json
+GET /html_strip_preserve_analyzer/_analyze
+{
+  "analyzer": "html_strip_analyzer",
+  "text": "<p>This is a <b>bold</b> and <i>italic</i> text.</p>"
+}
+```
+{% include copy-curl.html %}
+
+In the response, the `italic` and `bold` tags have been retained, as specified in the custom analyzer request:
+
+```
+This is a <b>bold</b> and <i>italic</i> text.
+```
diff --git a/_analyzers/character-filters/index.md b/_analyzers/character-filters/index.md
new file mode 100644
index 0000000000..0e2ce01b8c
--- /dev/null
+++ b/_analyzers/character-filters/index.md
@@ -0,0 +1,19 @@
+---
+layout: default
+title: Character filters
+nav_order: 90
+has_children: true
+has_toc: false
+---
+
+# Character filters
+
+Character filters process text before tokenization to prepare it for further analysis.
+
+Unlike token filters, which operate on tokens (words or terms), character filters process the raw input text before tokenization. They are especially useful for cleaning or transforming structured text containing unwanted characters, such as HTML tags or special symbols. Character filters help to strip or replace these elements so that text is properly formatted for analysis.
+
+Use cases for character filters include:
+
+- **HTML stripping:** Removes HTML tags from content so that only the plain text is indexed.
+- **Pattern replacement:** Replaces or removes unwanted characters or patterns in text, for example, converting hyphens to spaces.
+- **Custom mappings:** Substitutes specific characters or sequences with other values, for example, to convert currency symbols into their textual equivalents.