Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Scaling Investigation] Validate Client Simulation Accuracy #557

Open
IanHoang opened this issue Jun 20, 2024 · 2 comments
Open

[Scaling Investigation] Validate Client Simulation Accuracy #557

IanHoang opened this issue Jun 20, 2024 · 2 comments
Labels
Child Issue enhancement New feature or request

Comments

@IanHoang
Copy link
Collaborator

IanHoang commented Jun 20, 2024

Experiment 1:

This is related to the scale testing RFC. For more details, see the RFC here.

To see other experiments in this analysis, see the META issue.

In this experiment we want to address the following questions:

  • Do search clients in OSB properly simulate actual clients in a client-server model?
  • For situations where workers have more than one search client, does OSB still properly simulate clients in a client-server model?

During a test, the Worker Coordinator Actor provisions and coordinates a number of Worker Actors that are responsible for driving requests to the SUT. These Worker Actors are allocated a number of clients to perform steps (also known as tasks or operations in a workload). It’s worth mentioning that the number of Worker Actors is determined by the number of CPU cores or vCPUs that the host running OSB has.

The two tables listed below (Autoscaling Group with OpenSearch Benchmark set to a single client on each EC2 instance and Load Generation Host with OpenSearch Benchmark) are two series of experiments to determine if a single load generation host can simulate the same performance as set of instances that all act as a single independent clients.

To reduce discrepancies, we ensure that experiments in Table 2: Load Generation Host with OpenSearch Benchmark has no more than 1 client assigned per worker actor. This can be seen with how the number of clients is always equal to or less than the number of vCPUs. This would match how each client or instance in the ASG in Table 1: Autoscaling Group with OpenSearch Benchmark always will use only one vCPU (even though they each will have 2 vCPUs).

Table 1: Determine Performance of an Autoscaling Group of N instances of OpenSearch Benchmark where search_clients = 1

Autoscaling Group with OpenSearch Benchmark Clients Instance Type Instance Count vCPUs Memory (GB)
Round 1 8 c5.large 8 16 32
Round 2 16 c5.large 16 32 64
Round 3 32 c5.large 32 64 128

In the table above, the gradual increase in instance count of the same instance type implies that there is a gradual progression of search clients. Each instance will be running OSB with one search client. When all the instances have finished running OSB, we can use a script to aggregate the results for service time across all instances in the ASG.

Table 2: Determine Performance of a Single Load Generation Host with OpenSearch Benchmark where search_clients = N

LG Hosts with OpenSearch Benchmark Simulated Clients (search_clients:N) Instance Type Instance Count vCPUs Memory (GB)
Round 1 8 c5.2xlarge 1 8 16
Round 2 16 c5.4xlarge 1 16 32
Round 3 32 c5.9xlarge 1 36 72

In the table above, there will only be a single load generation host.

After running experiments from Table 1 & 2, we should perform a comparison.

Table 3: Load Generation Host with OpenSearch Benchmark where search_clients = N & More Clients Per Worker

LG Hosts with OpenSearch Benchmark Simulated Clients (search_clients:N) Instance Type Instance Count vCPUs Memory (GB) Clients Per Worker Actor
Round 1 8 c5.large 1 2 16 4
Round 2 16 c5.large 1 2 16 8
Round 2 32 c5.large 1 2 16 16

Knowing how worker actors can be allocated more than one client, we should also rerun the load generation host with OpenSearch Benchmark but in a way where more clients are allocated to a worker actor, as seen in Table 3: Load Generation Host with OpenSearch Benchmark and More Clients Per Worker. This will confirm if we adding more clients to a worker (running with a smaller instance type where there are less CPU cores) can simulate the same performance where one client is assigned to one worker. In round 1 in the table above, we should expect to see two workers (since there are two vCPUs) with 4 clients each. In round 2 in the table above, we should see two workers with 8 clients each. We can compare them with the results from Table 2 (where we tested the same configurations but kept 1 client per worker). If we see no degradation here, scaling investigation 2 should offer stress the load generation host and help us determine what the max clients allowed per worker is.

Term Query

  {
      "name": "term",
      "operation-type": "search",
      "index": "{{index_name | default('big5')}}",
      "request-timeout": 7200,
      "body": {
      "query": {
          "term": {
          "log.file.path": {
              "value": "/var/log/messages/fuschiashoulder"
          }
          }
      }
      }
  },

The term query above is considered a fast query in the Big5 workload and can be used for our experiment.

Metrics to Analyze

With each round of tests, we’ll be comparing the metrics — such as query throughput and service time — seen in both clients from the ASG and the load generation host. We’ll also be monitoring the resource utilization in the ASGs, load generation host, and the system-under-test. If the system-under-test shows signs of resource bottlenecks, we will scale it out and rerun the numbers to ensure that the test results are not skewed.

Why are we not using latency?

OSB’s definition of latency is slightly different from the colloquial definition of latency. In OSB, when a user specifies a target throughput to achieve with the target-throughput parameter, latency is the service time plus the time that the request spends waiting in the queue. When OSB’s parameter target-throughput is not set, service time and latency are equivalent. The design of this parameter is for users who want to achieve a specific target-throughput, which might be for different reasons such as simulating target-throughput seen in their production clusters. Based on these reasons, for these experiments, we will not be setting target-throughput and the clients (in ASG and OSB) will send the queries as fast as possible. Therefore, we will be primarily focusing on service time as it should be equivalent to latency. For more information, see this article from OSB’s documentation.

@IanHoang IanHoang added enhancement New feature or request untriaged labels Jun 20, 2024
@IanHoang IanHoang changed the title [Scale Testing] Experiment 1 [Scale Testing] Experiment 1: Validate Simulations Jun 20, 2024
@IanHoang
Copy link
Collaborator Author

IanHoang commented Jun 24, 2024

Set Up Experiment Prerequisites

  • Set up large OpenSearch cluster: 20 Data Nodes (r5.large), 3 Master Nodes (c5.2xlarge)
  • Set up Auto Scaling Group with OSB (Table 1)
    • Set up AMI with OSB and Big5 installed
    • Set up launch template (ensure that commands have tags to denote that these were run from the Auto Scaling Group).
    • Test out with test cluster
  • Set up single load generation host (Table 2 and 3)
  • Set up metric data store (MDS)
  • Create script that aggregates results from MDS and produces summary of performance across all instances (or "clients") in Auto Scaling Group

@IanHoang IanHoang changed the title [Scale Testing] Experiment 1: Validate Simulations [Scale Testing] Experiment 1: Validate Simulation Accuracy Jun 27, 2024
@IanHoang IanHoang changed the title [Scale Testing] Experiment 1: Validate Simulation Accuracy [Scaling Investigation] Experiment 1: Validate Simulation Accuracy Jul 24, 2024
@IanHoang IanHoang changed the title [Scaling Investigation] Experiment 1: Validate Simulation Accuracy [Scaling Investigation] Validate Simulation Accuracy Jul 24, 2024
@IanHoang IanHoang changed the title [Scaling Investigation] Validate Simulation Accuracy [Scaling Investigation] Validate Client Simulation Accuracy Jul 24, 2024
@getsaurabh02 getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024
@IanHoang
Copy link
Collaborator Author

Scaling investigation scripts created a few weeks back. They can be found here: https://github.com/IanHoang/scaling-investigation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Child Issue enhancement New feature or request
Projects
Status: Later (6 months plus)
Development

No branches or pull requests

1 participant