Problems scaling Prometheus for proxy metrics #6087

smusick-teamwork · 2021-04-30T17:03:18Z

smusick-teamwork
Apr 30, 2021

So I'm trying to add Linkerd (version 2.9.4) to a couple clusters we have running, each of them with different sized workloads and traffic.
Each of our clusters is divided by several namespaces - based on the purposes of the workloads.
Everything was going fine until I ran into problems getting Prometheus and the Web dashboard to scale in our 2nd largest cluster. The single prometheus instance would constantly run out of memory and fail, even after giving it a resource and limit of 25gb.
The solution I found was to run a prometheus deployment in each of our namespaces. Those deployments scraped all of the proxies within their namespace and sent those metrics to another instance via remote write. The original prometheus instance in the linkerd namespace then was responsible for scraping the remaining, non-proxy metrics.
It wasn't a perfect solution. The Web Dashboard still fails often, but all the Grafana dashboards are working like a charm.

I've since tried the same approach in our larger, busier cluster. I've added 19 pods to the mesh in the first namespace and I'm running into the same issues that I saw in the first cluster. The scraper deployment just can't handle it.

We can't be the first people to run into these kind of issues. I must be missing something.
Has anyone run into similar issues and what solutions did they find?

ermiaqasemi · 2023-07-14T18:04:12Z

ermiaqasemi
Jul 14, 2023

We are experiencing the same issue, have you had any chance to fix it? @smusick-teamwork

0 replies

wmorgan · 2023-07-14T18:43:19Z

wmorgan
Jul 14, 2023
Maintainer

We bundle Prometheus in the viz extension because it's popular and easy to get started with, but we're not really experts in scaling it. There are a lot of guides out there devoted to scaling Prometheus so I would probably just pick one and start there, or maybe ask in one of the Prometheus forums.

0 replies

Charlotte-br560 · 2024-03-24T07:35:50Z

Charlotte-br560
Mar 24, 2024

It seems like you've set up a distributed Prometheus setup, but are still facing scalability issues with the scraper deployment. Consider optimizing Prometheus configurations for efficiency, adjusting scraping frequencies, or exploring horizontal scaling options such as deploying multiple instances behind a load balancer or using Thanos for long-term storage and querying.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems scaling Prometheus for proxy metrics #6087

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Problems scaling Prometheus for proxy metrics #6087

smusick-teamwork Apr 30, 2021

Replies: 3 comments

ermiaqasemi Jul 14, 2023

wmorgan Jul 14, 2023 Maintainer

Charlotte-br560 Mar 24, 2024

smusick-teamwork
Apr 30, 2021

ermiaqasemi
Jul 14, 2023

wmorgan
Jul 14, 2023
Maintainer

Charlotte-br560
Mar 24, 2024