[elasticsearch] Extension of the Elasticsearch integration with datas…

…tream-centric stats (#11656)
elastic · Jan 7, 2025 · ad70926 · ad70926
1 parent d93bf87
commit ad70926
Show file tree

Hide file tree

Showing 13 changed files with 7,203 additions and 2 deletions.
diff --git a/packages/elasticsearch/_dev/build/docs/README.md b/packages/elasticsearch/_dev/build/docs/README.md
@@ -150,3 +150,39 @@ information about all shards.
 {{event "shard"}}
 
 {{fields "shard"}}
+
+### Indices and data streams usage analysis
+
+_Technical preview: please report any issue [here](https://github.com/elastic/integrations/issues), and specify the "elasticsearch" integration_
+
+For version 8.17.1+ of the module and collected data, the integration also installs a transform job called `logs-elasticsearch.index_pivot-default-{VERSION}`. This transform **isn't started by default** (Stack management > Transforms), but will perform the following once activated:
+
+* Read the data from the `index` dataset, produced by this very same integration.
+* Aggregate the index-level stats in data-stream-centric insights, such as query count, query time or overall data volume.
+* This aggregated data is then processed through an additional, integration-installed, ingest pipeline (`{VERSION}-monitoring_indices`) before being shipped to a `monitoring-indices` index.
+
+You can then visualize the resulting data in the `[Elasticsearch] Indices & data streams usage` dashboard.
+
+![Indices & data streams usage](../img/indices_datastream_view.png)
+
+Apart from some high-level statistics, such as total query count, total query time and total addressable data, the dashboard surfaces usage information centered on two dimensions:
+
+* The [data tier](https://www.elastic.co/guide/en/elasticsearch/reference/current/data-tiers.html).
+* The data stream (see note below for details about how this is computed).
+
+#### Tier usage
+
+As data ages, it commonly reduces in relative importance and is commonly stored on less efficient and more cost-effective hardware. Usage count and query time should also proportionally diminish. Various visualizations in the dashboard allow you to verify this assumption on your data, and ensure your ILM policy (and therefore data tier transitions) are aligned with how the data is actually being used.
+
+#### Indices and data streams usage
+
+Other visualizations in the dashboard allow you to compare the relative footprint of each data stream, from a storage, querying and indexing perspective. This can help you identify anomalies, stemming from faulty configuration or poor user behavior.
+
+Both approaches can be used in conjunction, allowing you to fine-tune ILM on a data stream basis (if required) to closely match usage patterns.
+
+⚠️ Important notes:
+
+* The transform job will process all compatible historical data, which has two implications: 1. if you have pre-8.17.1 data, this will not get picked up by the job and 2. it might take time for "live" data to be available, as the transform job works its way through all documents. You can modify the transform job as you please if need be.
+* The target index `monitoring-indices` is not controlled by ILM. In case you work on a setup with a high count of indices or with a high retention, you may need to tune the transform job, or [activate ILM on the target index](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-management.html#manage-time-series-data-without-data-streams). Per our testing on a cluster with 5000 indices, we generated around 1GB of primary data for each week (your mileage may vary).
+* The identification of the data stream is based on the following grok pattern: `^(?:partial-)?(?:restored-)?(?:shrink-.{4}-)?(?:\\.ds-)?(?<elasticsearch.index.datastream>[a-z_0-9\\-\\.]+?)(-(?:\\d{4}\\.\\d{2}(\\.\\d{2})?))?(?:-\\d+)?$`. This should cover all "out of the box" names, but you can modify this to your liking in the `{VERSION}-monitoring_indices` ingest pipeline (though a copy is advised), if you are using non-standard names or would like to aggregate data differently.
+
diff --git a/packages/elasticsearch/changelog.yml b/packages/elasticsearch/changelog.yml
@@ -1,4 +1,9 @@
 # newer versions go on top
+- version: "1.16.0"
+  changes:
+    - description: Add transform job & dashboard for datastream metrics
+      type: enhancement
+      link: https://github.com/elastic/integrations/pull/11656
 - version: "1.15.3"
   changes:
     - description: Make elasticsearch.node.name a TSDS dimension to prevent document collisions.

diff --git a/packages/elasticsearch/data_stream/index/fields/fields.yml b/packages/elasticsearch/data_stream/index/fields/fields.yml
@@ -14,6 +14,12 @@
       type: keyword
     - name: status
       type: keyword
+    - name: tier_preference
+      type: keyword
+    - name: creation_date
+      type: date
+    - name: version
+      type: keyword
     - name: name
       type: keyword
       dimension: true

diff --git a/packages/elasticsearch/data_stream/index/sample_event.json b/packages/elasticsearch/data_stream/index/sample_event.json
@@ -28,6 +28,9 @@
         "index": {
             "hidden": true,
             "name": ".ml-state-000001",
+            "tier_preference": "data_content",
+            "creation_date": 1731657995821,
+            "version": "8503000",
             "primaries": {
                 "docs": {
                     "count": 0
@@ -141,4 +144,4 @@
         "address": "http://elastic-package-service-elasticsearch-1:9200",
         "type": "elasticsearch"
     }
-}
+}
diff --git a/packages/elasticsearch/docs/README.md b/packages/elasticsearch/docs/README.md
@@ -844,6 +844,9 @@ An example event for `index` looks as following:
         "index": {
             "hidden": true,
             "name": ".ml-state-000001",
+            "tier_preference": "data_content",
+            "creation_date": 1731657995821,
+            "version": "8503000",
             "primaries": {
                 "docs": {
                     "count": 0
@@ -974,6 +977,7 @@ An example event for `index` looks as following:
 | elasticsearch.cluster.id | Elasticsearch cluster id. | keyword |  |
 | elasticsearch.cluster.name | Elasticsearch cluster name. | keyword |  |
 | elasticsearch.cluster.state.id | Elasticsearch state id. | keyword |  |
+| elasticsearch.index.creation_date |  | date |  |
 | elasticsearch.index.hidden |  | boolean |  |
 | elasticsearch.index.name | Index name. | keyword |  |
 | elasticsearch.index.primaries.docs.count |  | long | gauge |
@@ -1009,6 +1013,7 @@ An example event for `index` looks as following:
 | elasticsearch.index.shards.primaries |  | long |  |
 | elasticsearch.index.shards.total |  | long |  |
 | elasticsearch.index.status |  | keyword |  |
+| elasticsearch.index.tier_preference |  | keyword |  |
 | elasticsearch.index.total.bulk.avg_size_in_bytes |  | long | gauge |
 | elasticsearch.index.total.bulk.avg_time_in_millis |  | long | gauge |
 | elasticsearch.index.total.bulk.total_operations |  | long | counter |
@@ -1049,6 +1054,7 @@ An example event for `index` looks as following:
 | elasticsearch.index.total.store.size.bytes |  | long | gauge |
 | elasticsearch.index.total.store.size_in_bytes | Total size of the index in bytes. | long | gauge |
 | elasticsearch.index.uuid |  | keyword |  |
+| elasticsearch.index.version |  | keyword |  |
 | elasticsearch.node.id | Node ID | keyword |  |
 | elasticsearch.node.master | Is the node the master node? | boolean |  |
 | elasticsearch.node.mlockall | Is mlockall enabled on the node? | boolean |  |
@@ -2648,3 +2654,39 @@ An example event for `shard` looks as following:
 | source_node.uuid |  | alias |
 | timestamp |  | alias |
 
+
+### Indices and data streams usage analysis
+
+_Technical preview: please report any issue [here](https://github.com/elastic/integrations/issues), and specify the "elasticsearch" integration_
+
+For version 8.17.1+ of the module and collected data, the integration also installs a transform job called `logs-elasticsearch.index_pivot-default-{VERSION}`. This transform **isn't started by default** (Stack management > Transforms), but will perform the following once activated:
+
+* Read the data from the `index` dataset, produced by this very same integration.
+* Aggregate the index-level stats in data-stream-centric insights, such as query count, query time or overall data volume.
+* This aggregated data is then processed through an additional, integration-installed, ingest pipeline (`{VERSION}-monitoring_indices`) before being shipped to a `monitoring-indices` index.
+
+You can then visualize the resulting data in the `[Elasticsearch] Indices & data streams usage` dashboard.
+
+![Indices & data streams usage](../img/indices_datastream_view.png)
+
+Apart from some high-level statistics, such as total query count, total query time and total addressable data, the dashboard surfaces usage information centered on two dimensions:
+
+* The [data tier](https://www.elastic.co/guide/en/elasticsearch/reference/current/data-tiers.html).
+* The data stream (see note below for details about how this is computed).
+
+#### Tier usage
+
+As data ages, it commonly reduces in relative importance and is commonly stored on less efficient and more cost-effective hardware. Usage count and query time should also proportionally diminish. Various visualizations in the dashboard allow you to verify this assumption on your data, and ensure your ILM policy (and therefore data tier transitions) are aligned with how the data is actually being used.
+
+#### Indices and data streams usage
+
+Other visualizations in the dashboard allow you to compare the relative footprint of each data stream, from a storage, querying and indexing perspective. This can help you identify anomalies, stemming from faulty configuration or poor user behavior.
+
+Both approaches can be used in conjunction, allowing you to fine-tune ILM on a data stream basis (if required) to closely match usage patterns.
+
+⚠️ Important notes:
+
+* The transform job will process all compatible historical data, which has two implications: 1. if you have pre-8.17.1 data, this will not get picked up by the job and 2. it might take time for "live" data to be available, as the transform job works its way through all documents. You can modify the transform job as you please if need be.
+* The target index `monitoring-indices` is not controlled by ILM. In case you work on a setup with a high count of indices or with a high retention, you may need to tune the transform job, or [activate ILM on the target index](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-management.html#manage-time-series-data-without-data-streams). Per our testing on a cluster with 5000 indices, we generated around 1GB of primary data for each week (your mileage may vary).
+* The identification of the data stream is based on the following grok pattern: `^(?:partial-)?(?:restored-)?(?:shrink-.{4}-)?(?:\\.ds-)?(?<elasticsearch.index.datastream>[a-z_0-9\\-\\.]+?)(-(?:\\d{4}\\.\\d{2}(\\.\\d{2})?))?(?:-\\d+)?$`. This should cover all "out of the box" names, but you can modify this to your liking in the `{VERSION}-monitoring_indices` ingest pipeline (though a copy is advised), if you are using non-standard names or would like to aggregate data differently.
+
diff --git a/packages/elasticsearch/elasticsearch/ingest_pipeline/monitoring_indices.yml b/packages/elasticsearch/elasticsearch/ingest_pipeline/monitoring_indices.yml
@@ -0,0 +1,142 @@
+---
+processors:
+  - set:
+      field: event.ingested
+      tag: set_event_ingested
+      value: "{{_ingest.timestamp}}"
+  - grok:
+      field: elasticsearch.index.name
+      tag: grok_parse_index_name
+      patterns:
+        - '^(?:partial-)?(?:restored-)?(?:shrink-.{4}-)?(?:\.ds-)?(?<elasticsearch.index.datastream>[a-z_0-9\-\.]+?)(-(?:\d{4}\.\d{2}(\.\d{2})?))?(?:-\d+)?$'
+      ignore_failure: true
+  - script:
+      source: |
+        def preference = ctx.end['elasticsearch.index.tier_preference'];
+        if (preference.contains("data_frozen")) {
+          ctx.elasticsearch.index.tier = "frozen";
+        } else if (preference.contains("data_cold")) {
+          ctx.elasticsearch.index.tier = "cold";
+        } else if (preference.contains("data_warm")) {
+          ctx.elasticsearch.index.tier = "warm";
+        } else if (preference.contains("data_hot") || preference.contains("data_content")) {
+          ctx.elasticsearch.index.tier = "hot/content";
+        }
+        ctx.end.remove('elasticsearch.index.tier_preference');
+      ignore_failure: true
+      tag: script_parse_index_tier
+  # Failure to identify the tier preference will result in the index tier being set to unknown
+  # This is also the "default" case when tier preference is not available.
+  - set:
+      field: elasticsearch.index.tier
+      value: "unknown"
+      tag: set_index_tier_unknown
+      if: "ctx.elasticsearch.index.tier == null"
+  - foreach:
+      field: end
+      processor:
+        set:
+          field: "{{ _ingest._key }}"
+          value: "{{ _ingest._value }}"
+      tag: set_end_fields
+  - dot_expander:
+      field: "*"
+      tag: dot_expander
+  - date:
+      field: elasticsearch.index.creation_date
+      target_field: elasticsearch.index.creation_date
+      ignore_failure: true
+      formats:
+        - UNIX_MS
+      tag: date_parse_index_creation_date
+  - script:
+      source: |
+        ZonedDateTime currentDate = ZonedDateTime.parse(ctx['@timestamp']);
+        ZonedDateTime creationDate = ZonedDateTime.parse(ctx.elasticsearch.index.creation_date);
+        long ageInMillis = ChronoUnit.MILLIS.between(creationDate, currentDate);
+        ctx.elasticsearch.index.age = (ageInMillis / (1000 * 60 * 60 * 24)).intValue();
+      ignore_failure: true
+      tag: script_compute_index_age
+  - convert:
+      field: elasticsearch.index.primaries.docs.count
+      type: long
+      ignore_failure: true
+      tag: convert_primaries_docs_count
+  - convert:
+      field: elasticsearch.index.primaries.docs.count_delta
+      type: long
+      ignore_failure: true
+      tag: convert_primaries_docs_count_delta
+  - convert:
+      field: elasticsearch.index.primaries.store.total_data_set_size_in_bytes
+      type: long
+      ignore_failure: true
+      tag: convert_primaries_store_total_data_set_size_in_bytes
+  - convert:
+      field: elasticsearch.index.primaries.store.total_data_set_size_in_bytes_delta
+      type: long
+      ignore_failure: true
+      tag: convert_primaries_store_total_data_set_size_in_bytes_delta
+  - convert:
+      field: elasticsearch.index.total.store.size_in_bytes
+      type: long
+      ignore_failure: true
+      tag: convert_total_store_size_in_bytes
+  - convert:
+      field: elasticsearch.index.total.store.size_in_bytes_delta
+      type: long
+      ignore_failure: true
+      tag: convert_total_store_size_in_bytes_delta
+  - convert:
+      field: elasticsearch.index.total.search.query_total
+      type: long
+      ignore_failure: true
+      tag: convert_total_search_query_total
+  - convert:
+      field: elasticsearch.index.total.search.query_total_delta
+      type: long
+      ignore_failure: true
+      tag: convert_total_search_query_total_delta
+  - convert:
+      field: elasticsearch.index.total.search.query_time_in_millis
+      type: long
+      ignore_failure: true
+      tag: convert_total_search_query_time_in_millis
+  - convert:
+      field: elasticsearch.index.total.search.query_time_in_millis_delta
+      type: long
+      ignore_failure: true
+      tag: convert_total_search_query_time_in_millis_delta
+  - convert:
+      field: elasticsearch.index.total.indexing.index_total
+      type: long
+      ignore_failure: true
+      tag: convert_total_indexing_index_total
+  - convert:
+      field: elasticsearch.index.total.indexing.index_total_delta
+      type: long
+      ignore_failure: true
+      tag: convert_total_indexing_index_total_delta
+  - convert:
+      field: elasticsearch.index.total.indexing.index_time_in_millis
+      type: long
+      ignore_failure: true
+      tag: convert_total_indexing_index_time_in_millis
+  - convert:
+      field: elasticsearch.index.total.indexing.index_time_in_millis_delta
+      type: long
+      ignore_failure: true
+      tag: convert_total_indexing_index_time_in_millis_delta
+  - remove:
+      field:
+        - start
+        - end
+      tag: remove_start_end_fields
+
+on_failure:
+  - set:
+      field: event.kind
+      value: "pipeline_error"
+  - append:
+      field: error.message
+      value: "Processor {{ _ingest.on_failure_processor_type }} with tag {{ _ingest.on_failure_processor_tag }} failed with message {{ _ingest.on_failure_message }}"