-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Scaling Investigation] Validate Client Simulation Accuracy #557
Labels
Comments
4 tasks
IanHoang
changed the title
[Scale Testing] Experiment 1
[Scale Testing] Experiment 1: Validate Simulations
Jun 20, 2024
github-project-automation
bot
moved this to Backlog
in OpenSearch Engineering Effectiveness
Jun 20, 2024
Set Up Experiment Prerequisites
|
IanHoang
changed the title
[Scale Testing] Experiment 1: Validate Simulations
[Scale Testing] Experiment 1: Validate Simulation Accuracy
Jun 27, 2024
IanHoang
changed the title
[Scale Testing] Experiment 1: Validate Simulation Accuracy
[Scaling Investigation] Experiment 1: Validate Simulation Accuracy
Jul 24, 2024
IanHoang
changed the title
[Scaling Investigation] Experiment 1: Validate Simulation Accuracy
[Scaling Investigation] Validate Simulation Accuracy
Jul 24, 2024
IanHoang
changed the title
[Scaling Investigation] Validate Simulation Accuracy
[Scaling Investigation] Validate Client Simulation Accuracy
Jul 24, 2024
Scaling investigation scripts created a few weeks back. They can be found here: https://github.com/IanHoang/scaling-investigation |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Experiment 1:
This is related to the scale testing RFC. For more details, see the RFC here.
To see other experiments in this analysis, see the META issue.
In this experiment we want to address the following questions:
During a test, the Worker Coordinator Actor provisions and coordinates a number of Worker Actors that are responsible for driving requests to the SUT. These Worker Actors are allocated a number of clients to perform steps (also known as tasks or operations in a workload). It’s worth mentioning that the number of Worker Actors is determined by the number of CPU cores or vCPUs that the host running OSB has.
The two tables listed below (Autoscaling Group with OpenSearch Benchmark set to a single client on each EC2 instance and Load Generation Host with OpenSearch Benchmark) are two series of experiments to determine if a single load generation host can simulate the same performance as set of instances that all act as a single independent clients.
To reduce discrepancies, we ensure that experiments in Table 2: Load Generation Host with OpenSearch Benchmark has no more than 1 client assigned per worker actor. This can be seen with how the number of clients is always equal to or less than the number of vCPUs. This would match how each client or instance in the ASG in Table 1: Autoscaling Group with OpenSearch Benchmark always will use only one vCPU (even though they each will have 2 vCPUs).
Table 1: Determine Performance of an Autoscaling Group of N instances of OpenSearch Benchmark where
search_clients = 1
In the table above, the gradual increase in instance count of the same instance type implies that there is a gradual progression of search clients. Each instance will be running OSB with one search client. When all the instances have finished running OSB, we can use a script to aggregate the results for service time across all instances in the ASG.
Table 2: Determine Performance of a Single Load Generation Host with OpenSearch Benchmark where
search_clients = N
In the table above, there will only be a single load generation host.
After running experiments from Table 1 & 2, we should perform a comparison.
Table 3: Load Generation Host with OpenSearch Benchmark where
search_clients = N
& More Clients Per WorkerKnowing how worker actors can be allocated more than one client, we should also rerun the load generation host with OpenSearch Benchmark but in a way where more clients are allocated to a worker actor, as seen in Table 3: Load Generation Host with OpenSearch Benchmark and More Clients Per Worker. This will confirm if we adding more clients to a worker (running with a smaller instance type where there are less CPU cores) can simulate the same performance where one client is assigned to one worker. In round 1 in the table above, we should expect to see two workers (since there are two vCPUs) with 4 clients each. In round 2 in the table above, we should see two workers with 8 clients each. We can compare them with the results from Table 2 (where we tested the same configurations but kept 1 client per worker). If we see no degradation here, scaling investigation 2 should offer stress the load generation host and help us determine what the max clients allowed per worker is.
Term Query
The term query above is considered a fast query in the Big5 workload and can be used for our experiment.
Metrics to Analyze
With each round of tests, we’ll be comparing the metrics — such as query throughput and service time — seen in both clients from the ASG and the load generation host. We’ll also be monitoring the resource utilization in the ASGs, load generation host, and the system-under-test. If the system-under-test shows signs of resource bottlenecks, we will scale it out and rerun the numbers to ensure that the test results are not skewed.
Why are we not using latency?
OSB’s definition of latency is slightly different from the colloquial definition of latency. In OSB, when a user specifies a target throughput to achieve with the target-throughput parameter, latency is the service time plus the time that the request spends waiting in the queue. When OSB’s parameter target-throughput is not set, service time and latency are equivalent. The design of this parameter is for users who want to achieve a specific target-throughput, which might be for different reasons such as simulating target-throughput seen in their production clusters. Based on these reasons, for these experiments, we will not be setting target-throughput and the clients (in ASG and OSB) will send the queries as fast as possible. Therefore, we will be primarily focusing on service time as it should be equivalent to latency. For more information, see this article from OSB’s documentation.
The text was updated successfully, but these errors were encountered: