Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Benchmark concepts of service time and latency #5916

Merged
merged 11 commits into from
Dec 22, 2023
89 changes: 88 additions & 1 deletion _benchmark/user-guide/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ parent: User guide

# Concepts

Before using OpenSearch Benchmark, familiarize yourself with the following concepts.
Before using OpenSearch Benchmark (OSB), familiarize yourself with the following concepts.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

## Core concepts and definitions

Expand All @@ -25,6 +25,93 @@ A workload is a specification of one or more benchmarking scenarios. A workload
- One or more data streams that are ingested into indexes.
- A set of queries and operations that are invoked as part of the benchmark.

## Throughput and latency

At the end of each test, OSB produces a table that summarizes the following:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

- [Service time](#service-time)
- Throughput
- [Latency](#latency)
- The error rate for each task or OpenSearch operation that ran
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to specify the ones that ran, can we say something like "The error rate for each completed task or OpenSearch operation"?


While the definition for _throughput_ remains consistent with other client-server systems, the definitions for `service time` and `latency` differ from most client-server systems in the context of OSB. The following table compares the OSB definition of service time and latency versus the common definitions for a client-server system.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

| Metric | Common definition | **OSB Definition** |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
| :--- | :--- |:--- |
| **Throughput** | The number of operations completed in a given period of time. | The number of operations completed in a given period of time. |
| **Service time** | The time the server takes to process a request, from the point it receives the request to the point the response is returned. </br></br> It includes the time spent waiting in server-side queues, but _excludes_ network latency, load-balancer overhead, and deserialization/serialization. | The time it takes for `opensearch-py` to send a request and receive a response from the OpenSearch cluster. </br> </br> It includes the time it takes for the server to process a request, and also _includes_ network latency, load-balancer overhead, and deserialization/serialization. |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
| **Latency** | The total time, including the service time and the time the request waited before responding. | Based on the `target-throughput` set by the user, the total time the request waited before receiving the response, in addition to any other delays that occur before the request is sent. |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "occur" be "occurred"?


For more information on service time and latency in OSB, see the [Service time](#service-time) and [Latency](#latency) sections.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved


### Service time

OSB does not have insight into how long OpenSearch takes to process a request, apart from extracting the `took` time for the request. In OpenSearch, **service time** tracks the durations between the point OpenSearch issues a requests and when OpenSearch receives the response.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

OSB makes function calls to `opensearch-py` to communicate with an OpenSearch cluster. OSB tracks the time between when the `opensearch-py` client sends the request and receives a response from the OpenSearch cluster and considers this to be the service time. Unlike the traditional definition of service time, OSB’s definition of service time includes overhead, such as network latency, load-balancer overhead, or deserialization/serialization. The following image highlights the differences in the traditional definition of service time and OSB's definition of service time.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

<img src="{{site.url}}{{site.baseurl}}/images/benchmark/service-time.png" alt="">

### Latency

Target throughput is core to understanding OSB’s definition of **latency**. Target throughput is the rate at which OSB issues requests and assumes the responses will come back instantaneously. `target-throughput` is one of the common workload parameters that can be set for each test and is measured in operations per second.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it accurate to say "the rate at which OpenSearch Benchmark...assumes"? Can one assume something at a given rate? Can we instead say something like "Target throughput is the rate at which OpenSearch Benchmark issues requests, assuming that responses will be returned instantaneously"?


OSB always issues one request at a time for a single client thread, specified as `search-clients` in the workload parameters. If `target-throughput` is set to `0`, OSB issues a request immediately after it receives the response from the previous request. If the `target-throughput` is not 0, OSB issues the next request to match the `target-throughput` set, assuming responses return instantaneously.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

#### Example A

The following diagrams illustrate how latency is calculated when we expect a request response time of 200ms and use the following settings:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

- `search-clients` is set to `1`
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
- `target-throughput` is set to `1` operations per second
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

<img src="{{site.url}}{{site.baseurl}}/images/benchmark/latency-explanation-1.png" alt="">

When a request takes longer than 200ms, such as when a request takes 1110ms instead of 400ms, OSB sends the next request that was supposed to occur at 4.00s based on the `target-throughput` at 4.10s. All subsequent requests after the 4.10s request attempt to resynchronize with the `target-throughput` setting.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

<img src="{{site.url}}{{site.baseurl}}/images/benchmark/latency-explanation-2.png" alt="">

When measuring the overall latency, OSB looks all the requests performed. All requests have a latency of 200ms except for the following two requests:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

- The request that lasted 1100ms.
- The subsquent request which was supposed to start at 4:00s. This request was delayed by 100ms, denoted by the orange area in the following diagram, and had a response of 200ms. When calculating the latency for this request, OSB will account for the delayed start time and combine it with the response time. Thus, the latency for this request is **300ms**.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

<img src="{{site.url}}{{site.baseurl}}/images/benchmark/latency-explanation-3.png" alt="">

#### Example B

In this example, OSB assumes a latency of 200ms and uses the following latency settings:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

- `search_clients` set to 1
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "1" be in code font?

- `target-throughput` set to 10 operations per second
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "10" be in code font?


The following diagram shows the schedule built by OSB with the expected response time.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "time" be "times"?


<img src="{{site.url}}{{site.baseurl}}/images/benchmark/b-latency-explanation-1.png" alt="">

However, if the assumption is that all responses will take 200ms, 10 operations per second won't be possible. Therefore, the highest throughput OSB can reach is 5 operations per second, as shown in the following diagram.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

<img src="{{site.url}}{{site.baseurl}}/images/benchmark/b-latency-explanation-2.png" alt="">

OSB does not account for this, and continues to try to achieve the `target-throughput` of 10 operations per second. Because of this, delays for each request being to cascade, as illustrated in the following diagram.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

<img src="{{site.url}}{{site.baseurl}}/images/benchmark/b-latency-explanation-3.png" alt="">

If we combine the service time with the delay for each operation, we’ll get the following latency measurements for each operation:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

- 200 ms for operation 1
- 300 ms for operation 2
- 400 ms for operation 3
- 500 ms for operation 4
- 600 ms for operation 5

This latency cascade continues, increasing latency by 100ms for each subsequent request.

### Recommendation

As shown in the preceding examples, be aware of the average service time of each task run and provide a `target-throughput` that accounts for the service time. OSB’s latency is calculated based on the `target-throughput` set by the user. In other words, OSB’s latency could be redefined as "throughput-based latency".
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

## Anatomy of a workload

The following example workload shows all of the essential elements needed to create a `workload.json` file. You can run this workload in your own benchmark configuration to understand how all of the elements work together:
Expand Down
Binary file added images/benchmark/b-latency-explanation-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/benchmark/b-latency-explanation-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/benchmark/b-latency-explanation-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/benchmark/latency-explanation-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/benchmark/latency-explanation-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/benchmark/latency-explanation-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/benchmark/service-time.png

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Naarcha-AWS, I believe the service time measured in OSB is just the time taken from "request reached server" to "response provided by server". Please correct me if I am wrong. Thanks :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IanHoang: What do you think? I can adjust the wording if needed.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.