[RFC]: Introducing Aggregation and Enhanced Comparison for OSB #627

OVI3D0 · 2024-08-27T23:37:19Z

Synopsis

OpenSearch Benchmark (OSB) is a performance testing tool for OpenSearch, a community-driven, open source search and analytics suite. It allows users to benchmark various aspects of OpenSearch, such as indexing, querying, and more, under different configurations and workloads. The Compare API is a feature in OSB that allows users to analyze and compare the performance differences between two benchmark test executions. While valuable, the current implementation has certain limitations. This RFC proposes enhancements to the Compare API which will improve how OSB analyzes and presents benchmark results, making OSB a more versatile tool for users in the OpenSearch community.

Motivation

Upon executing a test, OSB assigns a unique ID to each test execution result. The current implementation of the Compare API in OSB allows users to compare and analyze the results of two benchmark test executions by providing the UID of a test execution to be used as a baseline, as well as the UID of a contender which is compared to the baseline. Users can obtain these test execution IDs using the opensearch-benchmark list test-executions command.

The following is an example of how the compare API is invoked and its respective output.

$ opensearch-benchmark compare --baseline=729291a0-ee87-44e5-9b75-cc6d50c89702 --contender=a33845cc-c2e5-4488-a2db-b0670741ff9b
   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

Comparing baseline
  TestExecution ID: 729291a0-ee87-44e5-9b75-cc6d50c89702
  TestExecution timestamp: 2023-05-24 18:17:18 

with contender
  TestExecution ID: a33845cc-c2e5-4488-a2db-b0670741ff9b
  TestExecution timestamp: 2023-05-23 21:31:45


------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
                                                  Metric    Baseline    Contender               Diff
--------------------------------------------------------  ----------  -----------  -----------------
                        Min Indexing Throughput [docs/s]       19501        19118  -383.00000
                     Median Indexing Throughput [docs/s]       20232      19927.5  -304.45833
                        Max Indexing Throughput [docs/s]       21172        20849  -323.00000
...
               Query latency term (50.0 percentile) [ms]     2.10049      2.15421    +0.05372
               Query latency term (90.0 percentile) [ms]     2.77537      2.84168    +0.06630
              Query latency term (100.0 percentile) [ms]     4.52081      5.15368    +0.63287

The comparison output shows metrics and percent difference between the tests. This is particularly useful when evaluating the performance differences across test runs and OpenSearch versions and configurations. The Compare API comes with additional command-line options, such as including specific percentiles in the comparison, exporting the comparison to different output formats, and appending the comparison in the results file.

However, the Compare API has limitations.

The API only supports comparing results from two tests at a time, but users have expressed interest in comparing aggregated results across multiple runs of the same test. Users run the same test multiple times to reduce random error and ensure their results are consistent. Therefore, the abilities to aggregate results across multiple test runs and to compare these aggregated results are essential to the performance testing experience.
Additionally, the Compare API is limited to two output formats, Markdown and CSV, which does not provide users much flexibility, especially if they are using other tools to analyze results.

In performance testing, it is common practice to run the same test multiple times to account for any variability and ensure more consistent results. This variability can arise from various factors in the environment, as well as random fluctuations in the test environment. By aggregating the results, users can obtain a more reliable and representative measure of performance, reducing the impact of outliers or random variations.

Requirements

To address the limitations of the compare API and to enhance the overall data processing experience in OSB, the following capabilities should be added.

Ability to aggregate results across multiple test executions
Ability to compare aggregated results from two or more test executions
The following section proposes solutions that can fulfill these requirements.

Proposed Solutions:

Ability to aggregate results across multiple test executions: Introduce a new aggregate subcommand that allows users to specify a list of test execution IDs and generate an aggregated result. The aggregate subcommand will allow users to specify a list of test execution IDs, and OSB will compute weighted averages or medians for each metric across the specified test runs. For metrics involving percentiles (e.g., query latencies), the aggregation will compute the percentile values based on the combined and weighted distribution of all data points from the individual test runs. The contributions for each test run will be weighted proportionally based on the number of iterations or data points in that run. When aggregating results across multiple test executions, a validation step will be added to ensure the underlying workload configuration is consistent across all the test executions included. The aggregated result will be assigned a new ID for future reference and stored in a separate folder to maintain a less cluttered file system.

For example, if we have three test executions with the following median indexing throughput values and iteration counts:

- Test Execution 1: Median Indexing Throughput = 20,000 docs/s, Iterations = 1,000
- Test Execution 2: Median Indexing Throughput = 18,000 docs/s, Iterations = 2,000
- Test Execution 3: Median Indexing Throughput = 22,000 docs/s, Iterations = 1,500

The weighted average for median indexing throughput would be calculated as such:

Weighted Sum = (20,000 * 1,000) + (18,000 * 2,000) + (22,000 * 1,500)
            = 20,000,000 + 36,000,000 + 33,000,000
            = 89,000,000

Total Iterations = 1,000 + 2,000 + 1,500 = 4,500

Weighted Average Median Indexing Throughput = Weighted Sum / Total Iterations
                                            = 89,000,000 / 4,500
                                            = 19,777.78 docs/s

Example usage:

opensearch-benchmark aggregate --test-executions=<test_execution_id1>,<test_execution_id2>,...

Ability to compare aggregated results from two or more text executions: The existing compare subcommand will be enhanced to support comparing aggregated results from two or more groups of test executions. When aggregating results across multiple test executions, a validation step will be added to ensure that the underlying workload configuration (the type of operations being performed, the data set being used, the workload, etc.) is consistent across all the test executions being aggregated.
Incorporate automatic aggregation into new and existing features: Leverage the aggregate feature in other parts of OSB, such as automatically running a test multiple times and aggregating the results, or aggregating results from distributed workload generation (DWG) tests across multiple load generation hosts.
Automatic Test Iterations and Aggregation: Enhance the execute command to support running multiple iterations of a test and automatically aggregating the results. New flags include:
- --test-iterations: Specify the number of test iterations
- --aggregate: Control result aggregation
- --sleep-timer: Set a sleep timer between iterations
- --cancel-on-error: Choose to cancel on error
Additional Statistical Metrics: Include the following additional metrics in the aggregated results and comparisons:
- Relative Standard Deviation (RSD)
- Median
- Minimum
- Maximum

Subsequent issues will be created to address these requirements further and elaborate on implementation details.

Stakeholders

OpenSearch Benchmark tool users
OpenSearch developers and contributors
Product development teams working on OpenSearch or related products
Marketing teams responsible for promoting OpenSearch and its performance
Managed services providers offering OpenSearch
Corporate benchmarking teams evaluating OpenSearch
Performance engineering teams tracking OpenSearch performance

Use Cases

As an OpenSearch Benchmark tool user, I want to be able to aggregate the results of multiple test runs to reduce variability and ensure consistent performance.
As an OpenSearch developer or contributor, I want to analyze the performance impact of changes made to OpenSearch by comparing the aggregated results of test executions before and after the changes.
As a performance engineering team or managed service provider, I want to track the performance of OpenSearch releases over time by aggregating and comparing the results of benchmark tests across multiple test runs.
As a benchmarking team evaluating OpenSearch, I want to benchmark my specific use-cases and workloads by aggregating and comparing the results of multiple test executions tailored to my configurations.
As an OpenSearch Benchmark user, I would like to export comparison data to other analytics tools like Tableau or Amazon QuickSight for deeper analysis and visualization of the performance metrics.

How Can You Help?

Any general comments about the overall direction are welcome.
Indicating whether the areas identified above for introducing an aggregate command and enhancing the compare command include your scenarios and use-cases will be helpful in prioritizing them.
Provide early feedback testing the new features as soon as they become available.
Help out on the implementation! Check out the issues page for work that is ready to be picked up.

Open Questions

Are there any other output formats that would be useful besides Markdown, CSV, and JSON?
Are there any other statistical metrics that would be valuable to include in the aggregated results?
How should we handle potential inconsistencies in workload configurations when aggregating results from multiple test executions?

Next Steps

We will incorporate feedback and add more details on design, implementation and prototypes as they become available.

The text was updated successfully, but these errors were encountered:

IanHoang · 2024-08-28T16:39:07Z

This will be a great addition to OpenSearch Benchmark as it addresses several pain points that several users have had for years. It will also diversify OSB's capabilities and open up new development opportunities.

To add to the second proposed priority, when validating if the comparison can be performed, the compare feature should also determine if the two ids' test procedures (or scenarios) are different. Some things to also consider:

If the test procedures are different, we might want to do what @gaiksaya suggested and issue a warning statement to the console stating that OSB is about to compare two different test procedures. It can also help to ask the user on whether they want to proceed or not. This would help as we might not always want to abort or proceed with the comparison and let the user decide. Additionally, many users might not see the warning statement since it can get buried by the comparison output and this prompt would make users acknowledge this difference before continuing.

Comparison between <baseline-id> and <contender-id> have same workloads but different test procedures. 
Would you still like OSB to compare? [y/n]:

If the user decides to continue and compare two different test procedures of the same workload, we might only want to compare the similar tasks / operations from the two tests. The output can also include a statement saying that the test ids have different test procedures and only overlapping steps were compared.

Overall, great RFC and am excited to see what comes out of this!

OVI3D0 added RFC Request for comment on major changes untriaged labels Aug 27, 2024

This was referenced Aug 27, 2024

[META] Introduce Aggregate Command #628

Open

[Introduce Aggregate Command] Add Ability to aggregate results across multiple OSB test executions #629

Closed

IanHoang removed the untriaged label Aug 28, 2024

IanHoang assigned OVI3D0 Aug 28, 2024

This was referenced Aug 28, 2024

[Introduce Aggregate Subcommand] Enhance opensearch-benchmark compare command #630

Closed

[Introduce Aggregate Command] Incorporate Automatic Aggregation #631

Closed

IanHoang mentioned this issue Sep 25, 2024

Extend Compare Subcommand Capabilities #532

Closed

OVI3D0 changed the title ~~[RFC]: Introducing Aggregation, Enhanced Comparison, and JSON Export for OSB~~ [RFC]: Introducing Aggregation and Enhanced Comparison for OSB Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Introducing Aggregation and Enhanced Comparison for OSB #627

[RFC]: Introducing Aggregation and Enhanced Comparison for OSB #627

OVI3D0 commented Aug 27, 2024 •

edited

Loading

IanHoang commented Aug 28, 2024

[RFC]: Introducing Aggregation and Enhanced Comparison for OSB #627

[RFC]: Introducing Aggregation and Enhanced Comparison for OSB #627

Comments

OVI3D0 commented Aug 27, 2024 • edited Loading

Synopsis

Motivation

Requirements

Proposed Solutions:

Stakeholders

Use Cases

How Can You Help?

Open Questions

Next Steps

IanHoang commented Aug 28, 2024

OVI3D0 commented Aug 27, 2024 •

edited

Loading