[Feature]: Support elasticsearch data stream #4708

pengweiqhca · 2023-08-29T12:13:47Z

Requirement

Problem

I has created a data stream index template, and I tried --es.use-aliases=true but not work.

Proposal

No response

Open questions

No response

The text was updated successfully, but these errors were encountered:

jack78901 · 2024-02-23T20:29:16Z

Is this because the client library used for connecting to Elasticsearch is outdated? Namely, I see in the logs of version 1.54 of the collector that the client is olivere/[email protected]. This client is from July 27th, 2021. The client is also not being updated anymore, it would appear. Namely, it was last updated to version 7.0.32 on Mar 19th, 2022. This is almost two years ago now. Elastic has its own go library now: https://github.com/elastic/go-elasticsearch.

tronda · 2024-03-01T10:40:35Z

I see that the OpenSearch project has the same feature-request.

JaredTan95 · 2024-08-06T07:12:15Z

I am currently working on this feature, but I prefer to implement it in jaeger v2

yurishkuro · 2024-08-06T16:29:43Z

@JaredTan95 +2 for v2, but what specifically is the difference? Our storage implementations are currently identical in v1 and v2 (with the exception of configuration)

JaredTan95 · 2024-08-07T08:35:20Z

@JaredTan95 +2 for v2, but what specifically is the difference? Our storage implementations are currently identical in v1 and v2 (with the exception of configuration)

Currently, if we want to use esILM policy, we need to create aliases and we need to create the xxx-00001 initial index before jaeger-colelctor writes data. This increases the difficulty of operation and maintenance. ES data stream shields users from the above problems. We no longer need to do some bootstrap works, and can directly write data to es through data stream. Scrolling of indexes and switching of aliases are managed within the data stream its-self.

yurishkuro · 2024-08-08T01:13:30Z

that doesn't really address my question - all ES-specific details are hidden behind es.Factory, which is used by both Jaeger v1 and v2, so I don't see a difference
Data streams also need to be set up once, just like index mappings. We want to do that on collector start-up, with some coordination between multiple collector instances (if creation operation are idempotent then it's less of an issue)

Evesy · 2025-01-10T10:36:55Z

@JaredTan95 Is this still something you're looking at/working on?

We're also keen to look at using data streams for Jaeger. In our case we manage all the templates, lifecycle etc. outside of Jaeger, so our main requirements are to have Jaeger not add the date suffix to the index it writes to, and also to use only create operations when indexing (since index op type is not allowed for datastreams)

Data streams also need to be set up once, just like index mappings

You can get by without explicitly creating the datastream up front and just having a matching index template that is configured to use a data stream

zzzk1 · 2025-01-13T05:20:27Z

@yurishkuro @Evesy Follow this guide. We have two steps to complete: 1. Create an index template 2. Create a data stream.
We can utilize existing resources to implement our ideas.

To create a template using es\client.CreateTemplate(), we need the following components: mapping, setting, and index.
A new call data stream client sent an HTTP request: PUT _data_stream/my-data-stream to create it.

Do you have any advice? 🤔

yurishkuro · 2025-01-13T05:22:44Z

What sort of advice are you asking for?

data-dude · 2025-01-14T15:53:02Z

It's my understanding this is a pretty important feature. Datastreams allow me to grow index size without doing any work. If Jaeger doesn't support them then there's always a risk that the size I manually set for my indexes is too small.

zzzk1 · 2025-01-15T05:20:09Z

What sort of advice are you asking for?

follow:

run Jaeger locally, execute the command /jaeger/cmd/all-in-one/main.go with Elasticsearch as the backend storage.
Sent a GET request to {{ES_LOCAL_URL}}/jaeger-span-2025-01-15/_search?pretty and referenced span-response.json.

Please take note of the following:
Index Template: 1. jaeger-dependencies 2. jaeger-span 3. jaeger-service Only jaeger-span-2025-01-15 contains a timestamp field. Should we create a data stream for jaeger-span only?

yurishkuro · 2025-01-15T14:41:51Z

Yes we can start with that. Services index is just a cache to avoid scanning the whole database to find distinct service names - perhaps ES can maintain that automatically as some sort of materialized preaggregation.

Manik2708 · 2025-01-16T16:06:41Z

@yurishkuro @zzzk1 I have been studying about data streams for a while and I think these points need to be covered:

I don't think we need to extend the support for data stream in es-rollover, wouldn't it be better to integrate ILM/ISM and data streams directly with jaeger binary and depreceating the mannual rollover. This also refers to: Remove the need for external tools for managing Elasticsearch #6283.
Using ILM with data streams would be ideal but auto creating policy is challenging, as the policy for backing stream indexes and automatic rollover will look like (as per official docs):

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb"
          }
        }
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          }
        }
      },
      "cold": {
        "min_age": "60d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "found-snapshots"
          }
        }
      },
      "frozen": {
        "min_age": "90d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "found-snapshots"
          }
        }
      },
      "delete": {
        "min_age": "735d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Now making the policy configurable for the user is going to be a challenge. We can choose to create our own policy struct (because it is not provided by ES-Client) but that will lead to maintainence overhead as keeping it always synchornized with ES updates will be challenging.
3. What part of policy needs to be configurable? I am not in support of creating the whole policy configurable, take an example when using data streams we should not skip rollover.
4. Is this change backward compatible? probably yes! because reading from the data stream is same as reading from index but the real question is when dealing with old data will it work as expected? Users updating to newer version will face two kind of data, one through old index and another through data streams. I am searching for this but currently unclear whether this is an acceptable situation or not.
5. Will data streams be an option or is jaeger going to use only data streams in future? This will answer this question: Should we update the existing index templates to make them compatible with data streams or create new templates so that if user enables data streams they could come in action.

Manik2708 · 2025-01-16T16:26:05Z

@JaredTan95 Is this still something you're looking at/working on?

We're also keen to look at using data streams for Jaeger. In our case we manage all the templates, lifecycle etc. outside of Jaeger, so our main requirements are to have Jaeger not add the date suffix to the index it writes to, and also to use only create operations when indexing (since index op type is not allowed for datastreams)

I don't think it will break your requirements. Adding a data stream will not change the naming convention of index of jaeger! As far what could I understand from the docs, it will only require adding @timestamp in mapping https://www.elastic.co/guide/en/elasticsearch/reference/current/set-up-a-data-stream.html#create-component-templates

data-dude · 2025-01-16T16:32:00Z

Using ILM with datastreams makes sense to me. People are definitely going to want to configure age values. Datastreams makes the system more robust and easy to maintain. I'm not sure why someone wouldn't want to use it for something like traces. But some people may still want to use the old way, I dunno.

yurishkuro · 2025-01-18T01:09:50Z

@Manik2708 what if we didn't support any automation or customization? We can include a sample JSON for creating a policy, if people need to customize it they can. The only automation that I think will be useful is if the policy was automatically created when Jaeger starts the first time against empty storage, by using that same JSON file. We can provide an option for a user to override that file as a whole, if they want (they can also edit the policy via ES JSON API, I assume).

Manik2708 · 2025-01-18T05:19:41Z

@Manik2708 what if we didn't support any automation or customization? We can include a sample JSON for creating a policy, if people need to customize it they can. The only automation that I think will be useful is if the policy was automatically created when Jaeger starts the first time against empty storage, by using that same JSON file. We can provide an option for a user to override that file as a whole, if they want (they can also edit the policy via ES JSON API, I assume).

This sounds good! But I have a question, Where are we gonna keep this json file so that user can edit/override that? I am assuming to keep it in factory just like mappings but then it will become difficult for the user to find that file!

(they can also edit the policy via ES JSON API, I assume).

Yes there is an API

yurishkuro · 2025-01-19T19:24:31Z

Where are we gonna keep this json file so that user can edit/override that

It will be in this repository and embedded in the binary. The documentation on the website can link to it, while explaining how to apply that policy against the db.

zzzk1 · 2025-01-22T13:15:59Z

@yurishkuro @Manik2708 I write a doc about how i am try to implement this feature, please see: Jaeger ES data stream proposal thanks.

Manik2708 · 2025-01-22T13:47:41Z

@yurishkuro @Manik2708 I write a doc about how i am try to implement this feature, please see: Jaeger ES data stream proposal thanks.

Things look good to me but some points over it:

We use two clients in ES: Olivere and the v8 client. There is no policy type in olivere but it is present in official client. Everywhere olivere client is used except in creating templates for v8 of ES. Even if we find a way still we can't embed the whole policy struct inside the config. Please see [Feature]: Support elasticsearch data stream #4708 (comment). This seems to be a good approach and later user can even update the policy by using ES APIs.
In OS, there is ISM policy instead of ILM policy so we have to take care of that also (although they are vey similar).
I have been implementing the ILM policy in jaeger and it is currently in E2E testing phase (it's very long PR (1500 lines) so currently testing it strictly) and the question which is always in my mind is that: Jaeger allows mannual installation of templates so what if user has enabled ILM but is installing templates manually? What if those templates (installed by user) become problem for rollover as it might not include policy name or rollover alias. The only solution which I could think of that is if user enables ILM then mannual installation of templates could not be allowed or we have to update the template according to the use of ILM. I think @yurishkuro can help us in this.
Will the policy which you have included in the docs be enough for the datastream? There are many phases and we have to discuss them thoroughly for what can create an effective data stream.
I have researched about ILM but not on data-streams currently but we need to ensure that whatever change we are introducing is backward compatible.

yurishkuro · 2025-01-23T01:15:29Z

@zzzk1 please allow comments on the doc.

@Manik2708

As discussed previously, I do not want to invent yet another configuration language to describe ILM policy. We should only support the official JSON representation of the policy. Incidentally, it means we should use raw REST API for creating it, not a strongly typed ES client.
yes
My proposal is that we make strongly opinionated choice and only support one way of running ES with data streams and ILM. Deprecate all other modes.
Another reason why I don't want to deal with that at all - if they want complex policy, they can create a complex JSON and either have Jaeger execute it on start-up or even execute it themselves.
I don't know if data streams are compatible with any other ways. We recently landed a breaking change [v2][query] Create archive reader/writer using regular factory methods #6519 which has a migration path by manually creating (once) index aliases. We could investigate if similar steps are available for backwards compatibility, but if not we do it the new way. Most of the storage implementation doesn't actually change, so if people already have a legacy setup when it might still work.

pengweiqhca added the enhancement label Aug 29, 2023

yurishkuro added the help wanted Features that maintainers are willing to accept but do not have cycles to implement label Sep 25, 2024

yurishkuro mentioned this issue Sep 25, 2024

Support using ISM (instead of ILM) for index state management #3279

Closed

yurishkuro mentioned this issue Jan 14, 2025

Remove the need for external tools for managing Elasticsearch #6283

Closed

5 tasks

zzzk1 linked a pull request Jan 16, 2025 that will close this issue

[WIP][feat]: Support elasticsearch data stream #6551

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support elasticsearch data stream #4708

[Feature]: Support elasticsearch data stream #4708

pengweiqhca commented Aug 29, 2023 •

edited by yurishkuro

Loading

jack78901 commented Feb 23, 2024

tronda commented Mar 1, 2024

JaredTan95 commented Aug 6, 2024

yurishkuro commented Aug 6, 2024

JaredTan95 commented Aug 7, 2024

yurishkuro commented Aug 8, 2024

Evesy commented Jan 10, 2025 •

edited

Loading

zzzk1 commented Jan 13, 2025

yurishkuro commented Jan 13, 2025

data-dude commented Jan 14, 2025 •

edited

Loading

zzzk1 commented Jan 15, 2025

yurishkuro commented Jan 15, 2025

Manik2708 commented Jan 16, 2025

Manik2708 commented Jan 16, 2025

data-dude commented Jan 16, 2025

yurishkuro commented Jan 18, 2025 •

edited

Loading

Manik2708 commented Jan 18, 2025

yurishkuro commented Jan 19, 2025

zzzk1 commented Jan 22, 2025

Manik2708 commented Jan 22, 2025

yurishkuro commented Jan 23, 2025

[Feature]: Support elasticsearch data stream #4708

[Feature]: Support elasticsearch data stream #4708

Comments

pengweiqhca commented Aug 29, 2023 • edited by yurishkuro Loading

Requirement

Problem

Proposal

Open questions

jack78901 commented Feb 23, 2024

tronda commented Mar 1, 2024

JaredTan95 commented Aug 6, 2024

yurishkuro commented Aug 6, 2024

JaredTan95 commented Aug 7, 2024

yurishkuro commented Aug 8, 2024

Evesy commented Jan 10, 2025 • edited Loading

zzzk1 commented Jan 13, 2025

yurishkuro commented Jan 13, 2025

data-dude commented Jan 14, 2025 • edited Loading

zzzk1 commented Jan 15, 2025

yurishkuro commented Jan 15, 2025

Manik2708 commented Jan 16, 2025

Manik2708 commented Jan 16, 2025

data-dude commented Jan 16, 2025

yurishkuro commented Jan 18, 2025 • edited Loading

Manik2708 commented Jan 18, 2025

yurishkuro commented Jan 19, 2025

zzzk1 commented Jan 22, 2025

Manik2708 commented Jan 22, 2025

yurishkuro commented Jan 23, 2025

pengweiqhca commented Aug 29, 2023 •

edited by yurishkuro

Loading

Evesy commented Jan 10, 2025 •

edited

Loading

data-dude commented Jan 14, 2025 •

edited

Loading

yurishkuro commented Jan 18, 2025 •

edited

Loading