Skip to content

Commit

Permalink
[DOC] Character filter - Index & HTML Strip (opensearch-project#8206)
Browse files Browse the repository at this point in the history
* Addition of Character filter documentation
Signed-off-by: [email protected] <[email protected]>

Signed-off-by: [email protected] <[email protected]>

* Updates for readability and flow.

Signed-off-by: [email protected] <[email protected]>

* Addding and testing examples for the HTML analyser page.

Signed-off-by: [email protected] <[email protected]>

* updates for custom analyzer

Signed-off-by: [email protected] <[email protected]>

* correcting page flow and headings

Signed-off-by: [email protected] <[email protected]>

* updating nav order

Signed-off-by: [email protected] <[email protected]>

* Update _analyzers/character-filters/html-character-filter.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _analyzers/character-filters/html-character-filter.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _analyzers/character-filters/html-character-filter.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _analyzers/character-filters/html-character-filter.md

Signed-off-by: Melissa Vagi <[email protected]>

* Doc review completed

Signed-off-by: Melissa Vagi <[email protected]>

* Doc review completed

Signed-off-by: Melissa Vagi <[email protected]>

* Update _analyzers/character-filters/index.md

Signed-off-by: Melissa Vagi <[email protected]>

* review comments addressed

Signed-off-by: [email protected] <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: leanneeliatra <[email protected]>

* apply review suggestions

Signed-off-by: [email protected] <[email protected]>

* Update _analyzers/character-filters/html-character-filter.md

Signed-off-by: Melissa Vagi <[email protected]>

* Update _analyzers/character-filters/index.md

Signed-off-by: Melissa Vagi <[email protected]>

---------

Signed-off-by: [email protected] <[email protected]>
Signed-off-by: Melissa Vagi <[email protected]>
Signed-off-by: leanneeliatra <[email protected]>
Co-authored-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
  • Loading branch information
3 people committed Oct 17, 2024
1 parent a488b45 commit 0ef58ba
Show file tree
Hide file tree
Showing 2 changed files with 143 additions and 0 deletions.
124 changes: 124 additions & 0 deletions _analyzers/character-filters/html-character-filter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
---
layout: default
title: html_strip character filter
parent: Character filters
nav_order: 100
---

# `html_strip` character filter

The `html_strip` character filter removes HTML tags, such as `<div>`, `<p>`, and `<a>`, from the input text and renders plain text. The filter can be configured to preserve certain tags or decode specific HTML entities, such as `&nbsp;`, into spaces.

## Example: HTML analyzer

```json
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
"html_strip"
],
"text": "<p>Commonly used calculus symbols include &alpha;, &beta; and &theta; </p>"
}
```
{% include copy-curl.html %}

Using the HTML analyzer, you can convert the HTML character entity references into their corresponding symbols. The processed text would read as follows:

```
Commonly used calculus symbols include α, β and θ
```

## Example: Custom analyzer with lowercase filter

The following example query creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter:

```json
PUT /html_strip_and_lowercase_analyzer
{
"settings": {
"analysis": {
"char_filter": {
"html_filter": {
"type": "html_strip"
}
},
"analyzer": {
"html_strip_analyzer": {
"type": "custom",
"char_filter": ["html_filter"],
"tokenizer": "standard",
"filter": ["lowercase"]
}
}
}
}
}
```
{% include copy-curl.html %}

### Testing `html_strip_and_lowercase_analyzer`

You can run the following request to test the analyzer:

```json
GET /html_strip_and_lowercase_analyzer/_analyze
{
"analyzer": "html_strip_analyzer",
"text": "<h1>Welcome to <strong>OpenSearch</strong>!</h1>"
}
```
{% include copy-curl.html %}

In the response, the HTML tags have been removed and the plain text has been converted to lowercase:

```
welcome to opensearch!
```

## Example: Custom analyzer that preserves HTML tags

The following example request creates a custom analyzer that preserves HTML tags:

```json
PUT /html_strip_preserve_analyzer
{
"settings": {
"analysis": {
"char_filter": {
"html_filter": {
"type": "html_strip",
"escaped_tags": ["b", "i"]
}
},
"analyzer": {
"html_strip_analyzer": {
"type": "custom",
"char_filter": ["html_filter"],
"tokenizer": "keyword"
}
}
}
}
}
```
{% include copy-curl.html %}

### Testing `html_strip_preserve_analyzer`

You can run the following request to test the analyzer:

```json
GET /html_strip_preserve_analyzer/_analyze
{
"analyzer": "html_strip_analyzer",
"text": "<p>This is a <b>bold</b> and <i>italic</i> text.</p>"
}
```
{% include copy-curl.html %}

In the response, the `italic` and `bold` tags have been retained, as specified in the custom analyzer request:

```
This is a <b>bold</b> and <i>italic</i> text.
```
19 changes: 19 additions & 0 deletions _analyzers/character-filters/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
layout: default
title: Character filters
nav_order: 90
has_children: true
has_toc: false
---

# Character filters

Character filters process text before tokenization to prepare it for further analysis.

Unlike token filters, which operate on tokens (words or terms), character filters process the raw input text before tokenization. They are especially useful for cleaning or transforming structured text containing unwanted characters, such as HTML tags or special symbols. Character filters help to strip or replace these elements so that text is properly formatted for analysis.

Use cases for character filters include:

- **HTML stripping:** Removes HTML tags from content so that only the plain text is indexed.
- **Pattern replacement:** Replaces or removes unwanted characters or patterns in text, for example, converting hyphens to spaces.
- **Custom mappings:** Substitutes specific characters or sequences with other values, for example, to convert currency symbols into their textual equivalents.

0 comments on commit 0ef58ba

Please sign in to comment.