Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional PromQL operators to synthetic load #747

Merged
merged 5 commits into from
Oct 20, 2024

Conversation

kushalShukla-web
Copy link
Contributor

This PR enhances the synthetic load generation by incorporating additional PromQL operators in the 6_loadgen.yaml file:

  • Added binary arithmetic operators (joins) to cover more complex query types.
  • Included logical operators (and, or, unless) for better query testing coverage.
  • Added topk function to test query performance with ranked results.

@kushalShukla-web
Copy link
Contributor Author

Solves #705

Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I commented on a few, but they are all likely problemmatic.
I recommend you change tack to use metrics exposed by the fake webserver, since there are a lot of them and prombench can control them.

interval: 10s
type: instant
queries:
- expr: sum(node_cpu_seconds_total)/sum(container_memory_rss)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not terribly good as a test of operator performance, since it matches one series against one other series, both of which have no labels.

type: instant
queries:
- expr: sum(node_cpu_seconds_total)/sum(container_memory_rss)
- expr: rate(node_cpu_seconds_total[5m]) * 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better, at 256 series when tested.

queries:
- expr: sum(node_cpu_seconds_total)/sum(container_memory_rss)
- expr: rate(node_cpu_seconds_total[5m]) * 5
- expr: sum(go_gc_heap_goal_bytes)/sum(loadgen_query_duration_seconds_created)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the same structural problem as the first one.

interval: 10s
type: instant
queries:
- expr: node_cpu_seconds_total{mode="nice"} and node_cpu_seconds_total{namespace="default"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nodes don't have a namespace, so this is short-cutted to return nothing.

@kushalShukla-web
Copy link
Contributor Author

Hi, I commented on a few, but they are all likely problemmatic. I recommend you change tack to use metrics exposed by the fake webserver, since there are a lot of them and prombench can control them.

Actually i have tested this on one of the Pull Request where the prombench was running.

@bboreham
Copy link
Member

Ok but my recommendation remains the same.

@kushalShukla-web kushalShukla-web force-pushed the queries branch 4 times, most recently from fcab2db to a16c6db Compare September 21, 2024 04:26
Replaced Old metrics with the new ones
@bboreham
Copy link
Member

I have replaced some of the existing metrics with new ones from PromBench, such as:

How many of those did you count?

@kushalShukla-web
Copy link
Contributor Author

Hi @bboreham all the queries having value above 2000 , and codelabz metric is having 52000 different variations .

updated metrics with some heavy count
Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is getting better.

I am still interested to think about the cardinality you are expecting for each operator in each query.

At the end of the day there should be a balance across different kinds of load, so we can justify that prombench is a realistic test, and also we want the prometheus under load to be able to keep up.

type: instant
queries:
- expr: topk(2000, sum(rate(go_gc_duration_seconds_count[5m])) by (instance, job))
- expr: topk(10000, sum(codelab_api_request_duration_seconds_bucket) by (method,job))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topk(10000 is not realistic; nobody is going to scroll down 10,000 lines of screen output to find something.
k should be more like 10, or perhaps 100.
Also I don't think there are 10,000 combinations of method and job.
Also it is not valid to sum histogram buckets.

Comment on lines 64 to 67
- expr: codelab_api_request_duration_seconds_bucket{method="GET"} or codelab_api_request_duration_seconds_bucket{method="POST"}
- expr: codelab_api_request_duration_seconds_sum{status="200"} or codelab_api_request_duration_seconds_sum{status="500"}
- expr: codelab_api_request_duration_seconds_bucket{status="200"} and codelab_api_request_duration_seconds_bucket{method="GET"}
- expr: codelab_api_request_duration_seconds_count{method="POST"} and codelab_api_request_duration_seconds_count{status="500"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much point in doing multiple expressions that are essentially the same.
or is different to and, but after that you could have an /, taking the ratio of errors to all requests for instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized a couple of things since this comment: you have / in "arithmetic operation" above, but this and is never going to return anything because the labels on each side are different. We want the benchmark queries to make sense.

kushalShukla-web and others added 3 commits October 10, 2024 14:55
Slow down arithmetic_operation and logic_operator; take out a few
queries to avoid overloading the server.

Stop querying `_bucket` series directly; those should be used by
`histogram_quantile` or similar.

Use more realistic `k` parameters to `topk`.

Signed-off-by: Bryan Boreham <[email protected]>
For balance, to retain about the same overall load on the server as
before.

Signed-off-by: Bryan Boreham <[email protected]>
@bboreham
Copy link
Member

I trimmed down the newly-added queries a bit:

  • Slow down arithmetic_operation and logic_operator.
  • Took out a few queries to avoid overloading the server.
  • Stop querying _bucket series directly; those should be used by histogram_quantile or similar.
  • Use more realistic k parameters to topk.

I also trimmed down some pre-existing queries for balance, to retain about the same overall load on the server as before.

@bboreham bboreham merged commit 1bba995 into prometheus:master Oct 20, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants