[DOC] Character filter - Index & HTML Strip (opensearch-project#8206)

* Addition of Character filter documentation Signed-off-by: [email protected] <[email protected]> Signed-off-by: [email protected] <[email protected]> * Updates for readability and flow. Signed-off-by: [email protected] <[email protected]> * Addding and testing examples for the HTML analyser page. Signed-off-by: [email protected] <[email protected]> * updates for custom analyzer Signed-off-by: [email protected] <[email protected]> * correcting page flow and headings Signed-off-by: [email protected] <[email protected]> * updating nav order Signed-off-by: [email protected] <[email protected]> * Update _analyzers/character-filters/html-character-filter.md Signed-off-by: Melissa Vagi <[email protected]> * Update _analyzers/character-filters/html-character-filter.md Signed-off-by: Melissa Vagi <[email protected]> * Update _analyzers/character-filters/html-character-filter.md Signed-off-by: Melissa Vagi <[email protected]> * Update _analyzers/character-filters/html-character-filter.md Signed-off-by: Melissa Vagi <[email protected]> * Doc review completed Signed-off-by: Melissa Vagi <[email protected]> * Doc review completed Signed-off-by: Melissa Vagi <[email protected]> * Update _analyzers/character-filters/index.md Signed-off-by: Melissa Vagi <[email protected]> * review comments addressed Signed-off-by: [email protected] <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: leanneeliatra <[email protected]> * apply review suggestions Signed-off-by: [email protected] <[email protected]> * Update _analyzers/character-filters/html-character-filter.md Signed-off-by: Melissa Vagi <[email protected]> * Update _analyzers/character-filters/index.md Signed-off-by: Melissa Vagi <[email protected]> --------- Signed-off-by: [email protected] <[email protected]> Signed-off-by: Melissa Vagi <[email protected]> Signed-off-by: leanneeliatra <[email protected]> Co-authored-by: Melissa Vagi <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
leanneeliatra · Oct 17, 2024 · 0ef58ba · 0ef58ba
1 parent a488b45
commit 0ef58ba
Show file tree

Hide file tree

Showing 2 changed files with 143 additions and 0 deletions.
diff --git a/_analyzers/character-filters/html-character-filter.md b/_analyzers/character-filters/html-character-filter.md
@@ -0,0 +1,124 @@
+---
+layout: default
+title: html_strip character filter
+parent: Character filters
+nav_order: 100
+---
+
+# `html_strip` character filter
+
+The `html_strip` character filter removes HTML tags, such as `<div>`, `<p>`, and `<a>`, from the input text and renders plain text. The filter can be configured to preserve certain tags or decode specific HTML entities, such as `&nbsp;`, into spaces.
+
+## Example: HTML analyzer
+
+```json
+GET /_analyze
+{
+  "tokenizer": "keyword",
+  "char_filter": [
+    "html_strip"
+  ],
+  "text": "<p>Commonly used calculus symbols include &alpha;, &beta; and &theta; </p>"
+}
+```
+{% include copy-curl.html %}
+
+Using the HTML analyzer, you can convert the HTML character entity references into their corresponding symbols. The processed text would read as follows:
+
+```
+Commonly used calculus symbols include α, β and θ 
+```
+
+## Example: Custom analyzer with lowercase filter
+
+The following example query creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter:
+
+```json
+PUT /html_strip_and_lowercase_analyzer
+{
+  "settings": {
+    "analysis": {
+      "char_filter": {
+        "html_filter": {
+          "type": "html_strip"
+        }
+      },
+      "analyzer": {
+        "html_strip_analyzer": {
+          "type": "custom",
+          "char_filter": ["html_filter"],
+          "tokenizer": "standard",
+          "filter": ["lowercase"]
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+### Testing `html_strip_and_lowercase_analyzer`
+
+You can run the following request to test the analyzer:
+
+```json
+GET /html_strip_and_lowercase_analyzer/_analyze
+{
+  "analyzer": "html_strip_analyzer",
+  "text": "<h1>Welcome to <strong>OpenSearch</strong>!</h1>"
+}
+```
+{% include copy-curl.html %}
+
+In the response, the HTML tags have been removed and the plain text has been converted to lowercase:
+
+```
+welcome to opensearch!
+```
+
+## Example: Custom analyzer that preserves HTML tags
+
+The following example request creates a custom analyzer that preserves HTML tags:
+
+```json
+PUT /html_strip_preserve_analyzer
+{
+  "settings": {
+    "analysis": {
+      "char_filter": {
+        "html_filter": {
+          "type": "html_strip",
+          "escaped_tags": ["b", "i"]
+        }
+      },
+      "analyzer": {
+        "html_strip_analyzer": {
+          "type": "custom",
+          "char_filter": ["html_filter"],
+          "tokenizer": "keyword"
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+### Testing `html_strip_preserve_analyzer`  
+
+You can run the following request to test the analyzer:
+
+```json
+GET /html_strip_preserve_analyzer/_analyze
+{
+  "analyzer": "html_strip_analyzer",
+  "text": "<p>This is a <b>bold</b> and <i>italic</i> text.</p>"
+}
+```
+{% include copy-curl.html %}
+
+In the response, the `italic` and `bold` tags have been retained, as specified in the custom analyzer request:
+
+```
+This is a <b>bold</b> and <i>italic</i> text.
+```
diff --git a/_analyzers/character-filters/index.md b/_analyzers/character-filters/index.md
@@ -0,0 +1,19 @@
+---
+layout: default
+title: Character filters
+nav_order: 90
+has_children: true
+has_toc: false
+---
+
+# Character filters
+
+Character filters process text before tokenization to prepare it for further analysis.
+
+Unlike token filters, which operate on tokens (words or terms), character filters process the raw input text before tokenization. They are especially useful for cleaning or transforming structured text containing unwanted characters, such as HTML tags or special symbols. Character filters help to strip or replace these elements so that text is properly formatted for analysis.
+
+Use cases for character filters include:
+
+- **HTML stripping:** Removes HTML tags from content so that only the plain text is indexed.
+- **Pattern replacement:** Replaces or removes unwanted characters or patterns in text, for example, converting hyphens to spaces.
+- **Custom mappings:** Substitutes specific characters or sequences with other values, for example, to convert currency symbols into their textual equivalents.