From fabcb3b66b902ae15d37540b1c9343132135a8d6 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Fri, 22 Dec 2023 13:32:52 -0700 Subject: [PATCH 01/12] Add html strip processor documentation Signed-off-by: Melissa Vagi --- _ingest-pipelines/processors/html-strip.md | 86 ++++++++++++++++++++++ 1 file changed, 86 insertions(+) create mode 100644 _ingest-pipelines/processors/html-strip.md diff --git a/_ingest-pipelines/processors/html-strip.md b/_ingest-pipelines/processors/html-strip.md new file mode 100644 index 0000000000..d33533714d --- /dev/null +++ b/_ingest-pipelines/processors/html-strip.md @@ -0,0 +1,86 @@ +--- +layout: default +title: HTML strip +parent: Ingest processors +nav_order: 140 +--- + +# JSON processor + +The `html_strip` processor is used to . + +The following is the syntax for the `html_strip` processor: + +```json + +``` +{% include copy-curl.html %} + +## Configuration parameters + +The following table lists the required and optional parameters for the `html_strip` processor. + +Parameter | Required/Optional | Description | +|-----------|-----------|-----------| + + +## Using the processor + +Follow these steps to use the processor in a pipeline. + +### Step 1: Create a pipeline + +The following query creates a pipeline, named , that uses the `html_strip` processor to : + +```json + +``` +{% include copy-curl.html %} + +### Step 2 (Optional): Test the pipeline + +It is recommended that you test your pipeline before you ingest documents. +{: .tip} + +To test the pipeline, run the following query: + +```json + +``` +{% include copy-curl.html %} + +#### Response + +The following example response confirms that the pipeline is working as expected: + +```json + +``` + +### Step 3: Ingest a document + +The following query ingests a document into an index named `testindex1`: + +```json + +``` +{% include copy-curl.html %} + +#### Response + +The request indexes the document into the index and will index all documents with . + +```json + +``` + +### Step 4 (Optional): Retrieve the document + +To retrieve the document, run the following query: + +```json + +``` +{% include copy-curl.html %} + + \ No newline at end of file From 41a2bdf30b6cf15106707a2492085e0706dc6fac Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Fri, 22 Dec 2023 13:38:17 -0700 Subject: [PATCH 02/12] Add html strip processor documentation Signed-off-by: Melissa Vagi --- _ingest-pipelines/processors/html-strip.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_ingest-pipelines/processors/html-strip.md b/_ingest-pipelines/processors/html-strip.md index d33533714d..d164bb5a83 100644 --- a/_ingest-pipelines/processors/html-strip.md +++ b/_ingest-pipelines/processors/html-strip.md @@ -5,7 +5,7 @@ parent: Ingest processors nav_order: 140 --- -# JSON processor +# HTML strip processor The `html_strip` processor is used to . From de535f7b418cedae3dfb48f456661a93db702baa Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Wed, 22 May 2024 15:52:15 -0600 Subject: [PATCH 03/12] Add examples Signed-off-by: Melissa Vagi --- _ingest-pipelines/processors/html-strip.md | 103 ++++++++++++++++++--- 1 file changed, 91 insertions(+), 12 deletions(-) diff --git a/_ingest-pipelines/processors/html-strip.md b/_ingest-pipelines/processors/html-strip.md index d164bb5a83..3279c58dea 100644 --- a/_ingest-pipelines/processors/html-strip.md +++ b/_ingest-pipelines/processors/html-strip.md @@ -7,7 +7,7 @@ nav_order: 140 # HTML strip processor -The `html_strip` processor is used to . +The `html_strip` processor removes HTML tags from string fields in incoming documents. The processor is useful when indexing data from web pages or other sources that may contain HTML markup. By removing the HTML tags, you can ensure that the indexed content is clean and easily searchable. HTML tags are replaced with newline characters (`\n`). The following is the syntax for the `html_strip` processor: @@ -22,7 +22,14 @@ The following table lists the required and optional parameters for the `html_str Parameter | Required/Optional | Description | |-----------|-----------|-----------| - +`field` | Required | The string field from which to remove HTML tags. +`target_field` | Optional | The field to assign the cleaned value to. If not specified, field is updated in-place. +`ignore_missing` | Optional | Default is `false`. If `true`, the processor quietly exits without modifying the document when field does not exist. +`description` | Optional | Description of the processor's purpose or configuration. +`if` | Optional | Conditionally execute the processor. +`ignore_failure` | Optional | Ignore failures for the processor. See [Handling pipeline failures]({{site.url}}{{site.baseurl}}/ingest-pipelines/pipeline-failures/). +`on_failure` | Optional | Handle failures for the processor. See [Handling pipeline failures]({{site.url}}{{site.baseurl}}/ingest-pipelines/pipeline-failures/). +`tag` | Optional | Identifier for the processor. Useful for debugging and metrics. ## Using the processor @@ -30,10 +37,21 @@ Follow these steps to use the processor in a pipeline. ### Step 1: Create a pipeline -The following query creates a pipeline, named , that uses the `html_strip` processor to : +The following query creates a pipeline named `strip-html-pipeline` that uses the `html_strip` processor to remove HTML tags from the description field and store the processed value in a new field named `cleaned_description`: ```json - +PUT _ingest/pipeline/strip-html-pipeline +{ + "description": "A pipeline to strip HTML from description field", + "processors": [ + { + "html_strip": { + "field": "description", + "target_field": "cleaned_description" + } + } + ] +} ``` {% include copy-curl.html %} @@ -45,7 +63,16 @@ It is recommended that you test your pipeline before you ingest documents. To test the pipeline, run the following query: ```json - +POST _ingest/pipeline/strip-html-pipeline/_simulate +{ + "docs": [ + { + "_source": { + "description": "This is a test description with some HTML tags." + } + } + ] +} ``` {% include copy-curl.html %} @@ -54,33 +81,85 @@ To test the pipeline, run the following query: The following example response confirms that the pipeline is working as expected: ```json - +{ + "docs": [ + { + "doc": { + "_index": "_index", + "_id": "_id", + "_source": { + "description": "This is a test description with some HTML tags.", + "cleaned_description": "This is a test description with some HTML tags." + }, + "_ingest": { + "timestamp": "2024-05-22T21:46:11.227974965Z" + } + } + } + ] +} ``` +{% include copy-curl.html %} ### Step 3: Ingest a document -The following query ingests a document into an index named `testindex1`: +The following query ingests a document into an index named `products`: ```json - +PUT products/_doc/1?pipeline=strip-html-pipeline +{ + "name": "Product 1", + "description": "This is a test product with some HTML tags." +} ``` {% include copy-curl.html %} #### Response -The request indexes the document into the index and will index all documents with . +The response shows that the request has indexed the document into the index `products` and will index all documents with the `description` field containing HTML tags, while storing the cleaned version in the `cleaned_description` field. ```json - +{ + "_index": "products", + "_id": "1", + "_version": 1, + "result": "created", + "_shards": { + "total": 2, + "successful": 1, + "failed": 0 + }, + "_seq_no": 0, + "_primary_term": 1 +} ``` +{% include copy-curl.html %} ### Step 4 (Optional): Retrieve the document To retrieve the document, run the following query: ```json - +GET products/_doc/1 ``` {% include copy-curl.html %} - \ No newline at end of file +#### Response + +The response includes both the original `description` field and the `cleaned_description` field with HTML tags removed. + +```json +{ + "_index": "products", + "_id": "1", + "_version": 1, + "_seq_no": 0, + "_primary_term": 1, + "found": true, + "_source": { + "cleaned_description": "This is a test product with some HTML tags.", + "name": "Product 1", + "description": "This is a test product with some HTML tags." + } +} +``` \ No newline at end of file From c76e12fa5cedbd8cc2d7101c7de8e80b9b40ca8a Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Wed, 5 Jun 2024 14:48:01 -0600 Subject: [PATCH 04/12] Copy edits Signed-off-by: Melissa Vagi --- _ingest-pipelines/processors/html-strip.md | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/_ingest-pipelines/processors/html-strip.md b/_ingest-pipelines/processors/html-strip.md index 3279c58dea..7a9c5665dc 100644 --- a/_ingest-pipelines/processors/html-strip.md +++ b/_ingest-pipelines/processors/html-strip.md @@ -12,7 +12,11 @@ The `html_strip` processor removes HTML tags from string fields in incoming docu The following is the syntax for the `html_strip` processor: ```json - +{ + "html_strip": { + "field": "webpage" + } +} ``` {% include copy-curl.html %} @@ -24,12 +28,12 @@ Parameter | Required/Optional | Description | |-----------|-----------|-----------| `field` | Required | The string field from which to remove HTML tags. `target_field` | Optional | The field to assign the cleaned value to. If not specified, field is updated in-place. -`ignore_missing` | Optional | Default is `false`. If `true`, the processor quietly exits without modifying the document when field does not exist. -`description` | Optional | Description of the processor's purpose or configuration. -`if` | Optional | Conditionally execute the processor. -`ignore_failure` | Optional | Ignore failures for the processor. See [Handling pipeline failures]({{site.url}}{{site.baseurl}}/ingest-pipelines/pipeline-failures/). -`on_failure` | Optional | Handle failures for the processor. See [Handling pipeline failures]({{site.url}}{{site.baseurl}}/ingest-pipelines/pipeline-failures/). -`tag` | Optional | Identifier for the processor. Useful for debugging and metrics. +`ignore_missing` | Optional | Specifies whether the processor should ignore documents that do not contain the specified field. Default is `false`. +`description` | Optional | A description of the processor's purpose or configuration. +`if` | Optional | Specifies to conditionally execute the processor. +`ignore_failure` | Optional | Specifies to ignore processor failures. See [Handling pipeline failures]({{site.url}}{{site.baseurl}}/ingest-pipelines/pipeline-failures/). +`on_failure` | Optional | Specifies a list of processors to run if the processor fails during execution. These processors are executed in the order they are specified. See [Handling pipeline failures]({{site.url}}{{site.baseurl}}/ingest-pipelines/pipeline-failures/). +`tag` | Optional | An identifier tag for the processor. Useful for debugging in order to distinguish between processors of the same type. ## Using the processor From 5803bd33fc4c8166681f206dd2859ef05cc3e3c3 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 08:45:20 -0600 Subject: [PATCH 05/12] Update _ingest-pipelines/processors/html-strip.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _ingest-pipelines/processors/html-strip.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_ingest-pipelines/processors/html-strip.md b/_ingest-pipelines/processors/html-strip.md index 7a9c5665dc..b0b6d03f72 100644 --- a/_ingest-pipelines/processors/html-strip.md +++ b/_ingest-pipelines/processors/html-strip.md @@ -7,7 +7,7 @@ nav_order: 140 # HTML strip processor -The `html_strip` processor removes HTML tags from string fields in incoming documents. The processor is useful when indexing data from web pages or other sources that may contain HTML markup. By removing the HTML tags, you can ensure that the indexed content is clean and easily searchable. HTML tags are replaced with newline characters (`\n`). +The `html_strip` processor removes HTML tags from string fields in incoming documents. This processor is useful when indexing data from webpages or other sources that may contain HTML markup. By removing the HTML tags, you can ensure that the indexed content is clean and easily searchable. HTML tags are replaced with newline characters (`\n`). The following is the syntax for the `html_strip` processor: From 764c597d79be537bb493a7e49244bcb1ccf9ce64 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 09:15:54 -0600 Subject: [PATCH 06/12] Update _ingest-pipelines/processors/html-strip.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _ingest-pipelines/processors/html-strip.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_ingest-pipelines/processors/html-strip.md b/_ingest-pipelines/processors/html-strip.md index b0b6d03f72..13bb085c04 100644 --- a/_ingest-pipelines/processors/html-strip.md +++ b/_ingest-pipelines/processors/html-strip.md @@ -27,7 +27,7 @@ The following table lists the required and optional parameters for the `html_str Parameter | Required/Optional | Description | |-----------|-----------|-----------| `field` | Required | The string field from which to remove HTML tags. -`target_field` | Optional | The field to assign the cleaned value to. If not specified, field is updated in-place. +`target_field` | Optional | The field to assign the cleaned value to. If not specified, then the field is updated in-place. `ignore_missing` | Optional | Specifies whether the processor should ignore documents that do not contain the specified field. Default is `false`. `description` | Optional | A description of the processor's purpose or configuration. `if` | Optional | Specifies to conditionally execute the processor. From 0d6356aeb84f4a40ac32c87cd1ec24bd23557d37 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 09:16:04 -0600 Subject: [PATCH 07/12] Update _ingest-pipelines/processors/html-strip.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _ingest-pipelines/processors/html-strip.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_ingest-pipelines/processors/html-strip.md b/_ingest-pipelines/processors/html-strip.md index 13bb085c04..4e14858339 100644 --- a/_ingest-pipelines/processors/html-strip.md +++ b/_ingest-pipelines/processors/html-strip.md @@ -120,7 +120,7 @@ PUT products/_doc/1?pipeline=strip-html-pipeline #### Response -The response shows that the request has indexed the document into the index `products` and will index all documents with the `description` field containing HTML tags, while storing the cleaned version in the `cleaned_description` field. +The response shows that the request has indexed the document into the index `products` and will index all documents with the `description` field containing HTML tags while storing the clean version in the `cleaned_description` field: ```json { From 4d06d4db36a354c685deb3917615eef7dde5ca4a Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 09:16:11 -0600 Subject: [PATCH 08/12] Update _ingest-pipelines/processors/html-strip.md Co-authored-by: Nathan Bower Signed-off-by: Melissa Vagi --- _ingest-pipelines/processors/html-strip.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_ingest-pipelines/processors/html-strip.md b/_ingest-pipelines/processors/html-strip.md index 4e14858339..c21e432a56 100644 --- a/_ingest-pipelines/processors/html-strip.md +++ b/_ingest-pipelines/processors/html-strip.md @@ -150,7 +150,7 @@ GET products/_doc/1 #### Response -The response includes both the original `description` field and the `cleaned_description` field with HTML tags removed. +The response includes both the original `description` field and the `cleaned_description` field with HTML tags removed: ```json { From a2ab21e5569aa5d8fd58b5c65b2f54e97ed2fb96 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 09:16:39 -0600 Subject: [PATCH 09/12] Update _ingest-pipelines/processors/html-strip.md Signed-off-by: Melissa Vagi --- _ingest-pipelines/processors/html-strip.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_ingest-pipelines/processors/html-strip.md b/_ingest-pipelines/processors/html-strip.md index c21e432a56..18f03f7161 100644 --- a/_ingest-pipelines/processors/html-strip.md +++ b/_ingest-pipelines/processors/html-strip.md @@ -7,7 +7,7 @@ nav_order: 140 # HTML strip processor -The `html_strip` processor removes HTML tags from string fields in incoming documents. This processor is useful when indexing data from webpages or other sources that may contain HTML markup. By removing the HTML tags, you can ensure that the indexed content is clean and easily searchable. HTML tags are replaced with newline characters (`\n`). +The `html_strip` processor removes HTML tags from string fields in incoming documents. This processor is useful when indexing data from webpages or other sources that may contain HTML markup. HTML tags are replaced with newline characters (`\n`). The following is the syntax for the `html_strip` processor: From 2f7f17c29a7096ead5e81eb44bbb3128d1434b11 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 09:19:39 -0600 Subject: [PATCH 10/12] Update _ingest-pipelines/processors/html-strip.md Signed-off-by: Melissa Vagi --- _ingest-pipelines/processors/html-strip.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_ingest-pipelines/processors/html-strip.md b/_ingest-pipelines/processors/html-strip.md index 18f03f7161..c6cbc2cd1c 100644 --- a/_ingest-pipelines/processors/html-strip.md +++ b/_ingest-pipelines/processors/html-strip.md @@ -120,7 +120,7 @@ PUT products/_doc/1?pipeline=strip-html-pipeline #### Response -The response shows that the request has indexed the document into the index `products` and will index all documents with the `description` field containing HTML tags while storing the clean version in the `cleaned_description` field: +The response shows that the request has indexed the document into the index `products` and will index all documents with the `description` field containing HTML tags while storing the plain text version in the `cleaned_description` field: ```json { From f5a3d62b1e87fd3a153209500d6450c735082466 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 09:27:59 -0600 Subject: [PATCH 11/12] Update _ingest-pipelines/processors/html-strip.md Signed-off-by: Melissa Vagi --- _ingest-pipelines/processors/html-strip.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_ingest-pipelines/processors/html-strip.md b/_ingest-pipelines/processors/html-strip.md index c6cbc2cd1c..6b814e7fe3 100644 --- a/_ingest-pipelines/processors/html-strip.md +++ b/_ingest-pipelines/processors/html-strip.md @@ -27,7 +27,7 @@ The following table lists the required and optional parameters for the `html_str Parameter | Required/Optional | Description | |-----------|-----------|-----------| `field` | Required | The string field from which to remove HTML tags. -`target_field` | Optional | The field to assign the cleaned value to. If not specified, then the field is updated in-place. +`target_field` | Optional | The field to receive the plain text version after stripping HTML tags. If not specified, then the field is updated in-place. `ignore_missing` | Optional | Specifies whether the processor should ignore documents that do not contain the specified field. Default is `false`. `description` | Optional | A description of the processor's purpose or configuration. `if` | Optional | Specifies to conditionally execute the processor. From 8fd346de9b4a26308dea8a42dfc95bbb4fa582a5 Mon Sep 17 00:00:00 2001 From: Melissa Vagi Date: Thu, 6 Jun 2024 09:29:58 -0600 Subject: [PATCH 12/12] Update _ingest-pipelines/processors/html-strip.md Signed-off-by: Melissa Vagi --- _ingest-pipelines/processors/html-strip.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_ingest-pipelines/processors/html-strip.md b/_ingest-pipelines/processors/html-strip.md index 6b814e7fe3..ac33c45eae 100644 --- a/_ingest-pipelines/processors/html-strip.md +++ b/_ingest-pipelines/processors/html-strip.md @@ -27,7 +27,7 @@ The following table lists the required and optional parameters for the `html_str Parameter | Required/Optional | Description | |-----------|-----------|-----------| `field` | Required | The string field from which to remove HTML tags. -`target_field` | Optional | The field to receive the plain text version after stripping HTML tags. If not specified, then the field is updated in-place. +`target_field` | Optional | The field that receives the plain text version after stripping HTML tags. If not specified, then the field is updated in-place. `ignore_missing` | Optional | Specifies whether the processor should ignore documents that do not contain the specified field. Default is `false`. `description` | Optional | A description of the processor's purpose or configuration. `if` | Optional | Specifies to conditionally execute the processor.