Problems scaling Prometheus for proxy metrics #6087
Replies: 3 comments
-
We are experiencing the same issue, have you had any chance to fix it? @smusick-teamwork |
Beta Was this translation helpful? Give feedback.
-
We bundle Prometheus in the viz extension because it's popular and easy to get started with, but we're not really experts in scaling it. There are a lot of guides out there devoted to scaling Prometheus so I would probably just pick one and start there, or maybe ask in one of the Prometheus forums. |
Beta Was this translation helpful? Give feedback.
-
It seems like you've set up a distributed Prometheus setup, but are still facing scalability issues with the scraper deployment. Consider optimizing Prometheus configurations for efficiency, adjusting scraping frequencies, or exploring horizontal scaling options such as deploying multiple instances behind a load balancer or using Thanos for long-term storage and querying. |
Beta Was this translation helpful? Give feedback.
-
So I'm trying to add Linkerd (version 2.9.4) to a couple clusters we have running, each of them with different sized workloads and traffic.
Each of our clusters is divided by several namespaces - based on the purposes of the workloads.
Everything was going fine until I ran into problems getting Prometheus and the Web dashboard to scale in our 2nd largest cluster. The single prometheus instance would constantly run out of memory and fail, even after giving it a resource and limit of 25gb.
The solution I found was to run a prometheus deployment in each of our namespaces. Those deployments scraped all of the proxies within their namespace and sent those metrics to another instance via remote write. The original prometheus instance in the linkerd namespace then was responsible for scraping the remaining, non-proxy metrics.
It wasn't a perfect solution. The Web Dashboard still fails often, but all the Grafana dashboards are working like a charm.
I've since tried the same approach in our larger, busier cluster. I've added 19 pods to the mesh in the first namespace and I'm running into the same issues that I saw in the first cluster. The scraper deployment just can't handle it.
We can't be the first people to run into these kind of issues. I must be missing something.
Has anyone run into similar issues and what solutions did they find?
Beta Was this translation helpful? Give feedback.
All reactions