Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a prometheus label mapping component #2025

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

vaxvms
Copy link

@vaxvms vaxvms commented Nov 4, 2024

PR Description

This PR add a prometheus component to create a label based on a source_label value and a static mapping table.

Which issue(s) this PR fixes

For a large mapping table, using regex with prometheus.relabel is inefficient

Notes to the Reviewer

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated

@vaxvms vaxvms requested review from clayton-cornell and a team as code owners November 4, 2024 14:51
@CLAassistant
Copy link

CLAassistant commented Nov 4, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@clayton-cornell clayton-cornell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial doc review comments

@clayton-cornell clayton-cornell requested a review from a team November 5, 2024 18:27
@clayton-cornell clayton-cornell added the type/docs Docs Squad label across all Grafana Labs repos label Nov 5, 2024
@thampiotr
Copy link
Contributor

Thanks for contributing.

For a large mapping table, using regex with prometheus.relabel is inefficient

Could you share the use case and performance differences? In many cases users can use discovery.relabel to add the team=X labels - and then the overhead is much smaller, because it's not samples, but targets that are relabeled.

If the reason to add this is that prometheus.relabel it's not efficient, we will need some more data on how inefficient it is and whether it could be optimised to work faster. Any new solution we would consider will require benchmarks to prove that it's more efficient. We also generally want to avoid having many ways of doing the same thing - but there can be exceptions.

@vaxvms
Copy link
Author

vaxvms commented Nov 6, 2024

I have to map a label holding the service ID to a label holding the tenant ID.

My test case is using X services ID mapped to Y tenants. My test case is mapping 100 different services ID to a single tenant.
Obviously, Y<=X.

Using prometheus.relabel, with X set to 20k (and Y to 200), i'm consuming 20 vCPU. I'm also having a 49% failure rate

image

Using prometheus.mapping, with X set to 100k (and Y to 1k), i'm consuming 2 vCPU

image

As my worst case can be X == Y, i haven't try to summarize regex so i have a rule block for each service ID

@vaxvms vaxvms force-pushed the main branch 2 times, most recently from b72677a to 99c869c Compare November 15, 2024 09:44
@thampiotr
Copy link
Contributor

Thanks for sharing some performance data. Have you tried to use discovery.relabel instead?

The issue with prometheus.relabel is that you are relabelling every single sample. With discovery.relabel you can relabel the targets before they are scraped.

So, for example, in a cluster of 1k targets and each exposing 1k metrics, you would have 1 million relabels with prometheus.relabel as opposed to 1k relabels with discovery.relabel.

If we find that for some reason discovery.relabel cannot be used, I would like to understand better why. If we need to optimise prometheus.relabel I would like to explore that option too before committing to create a new feature to do a very similar job. We may also need to go through proposal process to let team/people comment on this.

@vaxvms
Copy link
Author

vaxvms commented Nov 20, 2024

Thanks for your answer.

I'm not using alloy to scrape targets but I'm receiving data thru remote_write from multiple data producers.
The data producers only have the information about the service_id but not the tenant_id, therefore they cannot apply the correct label at the scraping time.

I'm unable to see how we could optimize prometheus.relabel to be as performant as a key lookup. We have made some attempt to share cache between instances #1692 which aims to lower the memory consumption.

Here is a extract of my test configuration, using prometheus.relabel:

prometheus.receive_http "ingestion_head" {
    http {
        listen_address = "0.0.0.0"
        listen_port    = 9000
    }
    forward_to = [prometheus.relabel.allow_list.receiver]
}

prometheus.relabel "allow_list" {
    forward_to = [prometheus.remote_write.ingestion_tail.receiver]
    rule {
        action        = "keep"
        regex         = "(.+)"
        source_labels = ["service_id"]
    }


    // One rule block per service_id
    rule {
        source_labels = ["service_id"]
        regex         = "^000001$"
        target_label  = "tenant_id"
        replacement   = "tenant_42"
    }
    
    rule {
        source_labels = ["service_id"]
        regex         = "^000002$"
        target_label  = "tenant_id"
        replacement   = "tenant_190"
    }
    
    // ...
    
    rule {
        source_labels = ["service_id"]
        regex         = "^12345678$"
        target_label  = "tenant_id"
        replacement   = "tenant_8876"
    }
}

prometheus.remote_write "ingestion_tail" {
    endpoint {
        url = "http://mimir/"
    }
}

The prometheus.relabel can be fairly heavy on the resources especially when we do a 1:1 match, hence the proposal to have a mapping. Here is an example of a working proof of concept based on the code in the PR.
Here is a schema:

image

prometheus.receive_http "ingestion_head" {
  http {
    listen_address = "0.0.0.0"
    listen_port    = 9000
  }
  forward_to = [prometheus.relabel.allow_list.receiver]
}

prometheus.relabel "allow_list" {
  forward_to = [prometheus.mapping.tenants_mapping.receiver]

  // First ditch everything that don't have service_id
  rule {
    action        = "keep"
    regex         = "(.+)"
    source_labels = ["service_id"]
  }
}

prometheus.mapping "tenants_mapping" {
  forward_to = [prometheus.remote_write.ingestion_tail.receiver]

  src_label_name = "service_id"
  output_label_name = "tenant_id"

  mapping = {
   "000001" = "tenant_42",
   "000002" = "tenant_190",
   // ....
   "12345678" = "tenant8876",
  }
}
prometheus.remote_write "ingestion_tail" {
        endpoint {
                url = "http://mimir/"
        }
}

@wilfriedroset
Copy link

What we forgot to explicit with @vaxvms is that our initial design would benefit from #521 we would remove the need for cortex-tenant

@clayton-cornell
Copy link
Contributor

@vaxvms Did the doc topic at components/prometheus/prometheus.relabel accidentally fall off this PR in one of the force/push updates? Or was it removed intentionally? I don't see it in the file list anymore.

@vaxvms
Copy link
Author

vaxvms commented Nov 26, 2024

@vaxvms Did the doc topic at components/prometheus/prometheus.relabel accidentally fall off this PR in one of the force/push updates? Or was it removed intentionally? I don't see it in the file list anymore.

Are you talking about components/prometheus/prometheus.mapping ? I accidentally removed it. It back again.

I've updated the way the component work: The target_label isn't fixed anymore. each value of the source label can be mapped to severals labels

@vaxvms vaxvms force-pushed the main branch 2 times, most recently from d77e275 to c5db3a5 Compare November 26, 2024 14:16
@vaxvms
Copy link
Author

vaxvms commented Nov 28, 2024

Thanks for review @clayton-cornell
Most of the phrasing came from prometheus.relabel.md

@clayton-cornell
Copy link
Contributor

Most of the phrasing came from prometheus.relabel.md

Yeah that topic needs some attention too. I've got an open task to review and refactor the component docs.

@clayton-cornell clayton-cornell requested a review from a team November 28, 2024 18:22
SourceLabel string `alloy:"source_label,attr"`

// Mapping
LabelValuesMapping map[string]map[string]string `alloy:"mapping,attr"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a static map the best user experience for this? Should it support reading from a file - especially using the exports of a local.file or remote.s3 for example? This allows for polled updating of the static map without needing to reload/restart/directly edit the Alloy config.

Copy link
Author

@vaxvms vaxvms Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can dynamically read the mapping using alloy components. Using this method, it avoid having parsing logic in mapping component :

...
local.file "mapping" {
   filename = file.path_join(module_path, "mapping.json")
}

prometheus.mapping "tenants_mapping" {
  forward_to   = [prometheus.remote_write.cluster.receiver]
  source_label = "uid_label_key1"
  mapping      = encoding.from_json(local.file.mapping.content)
}
...

Did i miss something ?

return nil
}

func (c *Component) mapping(val float64, lbls labels.Labels) labels.Labels {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not looking to see if the val is a staleNaN for caching, we probably don't need it to still be in the function call.

Is there a reason this isn't using the global labelstore like prometheus.relabel? Did you test performance with/without it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i have used prometheus.relabel as a starting point, i forgot to remove val parameter. Fixed.

The performance improvement was so high, i didn't bothered looking for increased performance. Relabeling can be a heaving process involving regex whereas mapping is just a hash lookup.

Using the labelstore, it will have to

  • compute a hash based on labels
  • fetch the globalId in a map
  • try to fetch new labels set from cache

I might be missing something but i can't see why 2 map lookup would be faster than a single lookup

// Relabel against a copy of the labels to prevent modifying the original
// slice.
lb := labels.NewBuilder(lbls.Copy())
sourceValue := lb.Get(c.sourceLabel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic will attempt to map when the sourceLabel is not present. Rather than the mapping behave the same when sourceLabel is not present && sourceLabel has explicit value "", should there be an (optional) explicit defaultMapping for when sourceValue is not present?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any strong opinion about this one. Having all mapping values at the same place make more sense to mebut i don't really bother.

@vaxvms
Copy link
Author

vaxvms commented Dec 17, 2024

@dehaansa @clayton-cornell @thampiotr Is here anything I have to fix/document/explain/change ?

@wildum
Copy link
Contributor

wildum commented Dec 17, 2024

hey @vaxvms fyi most of the team is on holiday break. The review might be paused for the next few weeks

@vaxvms
Copy link
Author

vaxvms commented Dec 17, 2024

So I have to wish them to have a nice holiday !

@clayton-cornell clayton-cornell dismissed their stale review January 2, 2025 19:01

Clicked approve instead of comment

@clayton-cornell
Copy link
Contributor

@dehaansa will this be released as a GA component? or as a Preview/Experimental? The docs don't mention the --stability.levelruntime flag... I'm wondering if it needs to be added in here.

@dehaansa
Copy link
Contributor

dehaansa commented Jan 3, 2025

@dehaansa will this be released as a GA component? or as a Preview/Experimental? The docs don't mention the --stability.levelruntime flag... I'm wondering if it needs to be added in here.

Probably a good idea, should start as Public Preview.

@clayton-cornell
Copy link
Contributor

Probably a good idea, should start as Public Preview.

The doc suggestion is in place. I assume there's something code-side as well.. I'll leave that to you :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/docs Docs Squad label across all Grafana Labs repos
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants