Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding blog post for performance improvements in hybrid query in 2.15 #3020

Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions _community_members/varunudr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
name: Varun Jain
short_name: varunudr
title: 'OpenSearch Community Member: Varun Jain'
primary_title: Varun Jain
breadcrumbs:
icon: community
items:
- title: Community
url: /community/index.html
- title: Members
url: /community/members/index.html
- title: 'Varun Jain's Profile'
url: '/community/members/varunudr.html'
photo: '/assets/media/community/members/haoyng.jpg'
job_title_and_company: 'Software development engineer at AWS'
personas:
- author
permalink: '/community/members/varunudr.html'
---

**Varun Jain** is a Software Engineer at AWS, focused on working upon opensearch and its plugins.

Check failure on line 22 in _community_members/varunudr.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Varun. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Varun. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_community_members/varunudr.md", "range": {"start": {"line": 22, "column": 3}}}, "severity": "ERROR"}
195 changes: 195 additions & 0 deletions _posts/2014-06-26-performance-improvment-hybrid-query-215.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
---
layout: post
title: "Boosting Hybrid Query Performance in OpenSearch 2.15"
authors:
- gaievski
- vamshin
- macrakis
date: 2024-07-02
categories:
- technical-posts
meta_keywords: Multimodal search, searching with semantic and visual understanding, improve search relevance, hybrid search in OpenSearch 2.10, keyword and image search
meta_description: Improve hybrid query performance in OpenSearch 2.15.
martin-gaievski marked this conversation as resolved.
Show resolved Hide resolved
has_science_table: true
---

<style>
.green-clr {
background-color: #c1f0c1;
}

.light-green-clr {
background-color: #e3f8e3;
}

.lightest-green-clr {
background-color: #eefbee;
}

.bold {
font-weight: 700;
}

.left {
text-align: left;
}

.center {
text-align: center;
}


table {
font-size: 16px;
}

h3 {
font-size: 22px;
}

th {
background-color: #f5f7f7;
}​

</style>

Since its introduction in OpenSearch 2.10, hybrid search has become popular among customers who want to improve the relevance of their semantic search results. Hybrid queries combine full-text search and semantic search to provide better results than either method alone for a wide variety of applications in e-commerce, document search, log analytics, and data exploration. However, its complexity can lead to performance issues, especially when dealing with large datasets or complex query structures.

With each new release, OpenSearch has implemented numerous enhancements to improve the performance of hybrid search at scale. In version 2.15, the improvements in hybrid queries have resulted in up to a 70% performance improvement compared to version 2.13.

These improvements were achieved by analyzing and optimizing hot spots in the code. The development team focused on the following areas:

- **Conditional scoring logic:**
We have added conditional scoring logic. Previously, the core logic for collecting scores during a query was fixed, so all computations were performed regardless of their necessity. This often led to unnecessary calculations, especially when certain scoring computations were redundant for specific plugins or query types.

To address this inefficiency, we made the scoring logic conditional. Now, certain compute operations can be skipped if they are not required by the plugin in use. This optimization reduces computational overhead and accelerates query processing, leading to faster search results and better resource utilization.

The performance improvements from this change are substantial. According to our benchmarks, query processing speed has increased by up to 20% for certain use cases. This enhancement significantly boosts the efficiency of handling complex queries, making OpenSearch more robust and responsive. Details can be found in the following GitHub issues:
- [[Feature Request] Provide capability for not adding top docs collector in the query search path #13170](https://github.com/opensearch-project/OpenSearch/issues/13170)
- [Pass empty QueryCollectorContext in case of hybrid query to improve latencies by 20% #731](https://github.com/opensearch-project/neural-search/pull/731)

- **Replacing inefficient constructs:**
In analyzing the performance of 2.13, we found that the Java Streams API, while convenient and expressive, introduced unnecessary overhead in certain high-performance scenarios. This was particularly evident in areas with intensive data processing requirements.

To address this, 2.15 replaces Java Streams constructs with more performant alternatives, such as for-loops and optimized data handling techniques. These optimizations resulted in a performance gain of up to 25% in specific data processing tasks. This improvement helps OpenSearch handle larger datasets and more complex queries more efficiently. For more details on these changes, refer to the GitHub issue:

Check failure on line 74 in _posts/2014-06-26-performance-improvment-hybrid-query-215.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: performant. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: performant. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2014-06-26-performance-improvment-hybrid-query-215.md", "range": {"start": {"line": 74, "column": 68}}}, "severity": "ERROR"}
- [In hybrid query optimize the way we iterate over results and collect scores of sub queries #745](https://github.com/opensearch-project/neural-search/issues/745)

- **Eliminating unnecessary calculations:**
We also found that certain expensive calculations, such as computing hash codes for Query objects, were being performed unnecessarily. We removed these unnecessary calculations, so resources are now allocated more efficiently, speeding up hybrid queries.

The detailed analysis in the GitHub issue shows a 20% improvement in query processing speed:
- [Improve efficiency by eliminating unnecessary hash code computations #705](https://github.com/opensearch-project/neural-search/issues/705)

- **Optimized data structures:**
We also found a more efficient way to use priority queues, which are used for some sorting operations. In particular, we improved hybrid query latency by changing the allocation strategy of query hits objects. Previously, those objects were pre-populated. We now perform lazy initialization, removing the lowest-score element when the queue reaches its full capacity.

Benchmarking shows that these optimizations resulted in a performance gain of up to 10% in query processing times for specific data processing tasks. For more details, see the GitHub issue:
- [In hybrid query optimize the way we iterate over results and collect scores of sub queries #745](https://github.com/opensearch-project/neural-search/issues/745)

- **Reducing repetitive calculations:**
We have mitigated redundant internal computations by implementing value caching and reuse strategies. This has reduced the overall computational overhead within the system.

By optimizing the handling of repetitive calculations and promoting value reuse, we have sped up the system by 5%. For full details, see the GitHub issues:
- [Improve efficiency by caching and reusing internal calculations #756](https://github.com/opensearch-project/neural-search/issues/756)
- [Enhance performance through value caching and reuse strategies #764](https://github.com/opensearch-project/neural-search/issues/764)

Benchmark results have shown up to a 70% performance improvement for large (over 10M) datasets for hybrid queries in OpenSearch 2.15 compared to version 2.13. These benchmarks were conducted using a new OpenSearch Benchmark workload that the team has created specifically for evaluating semantic-search use cases.

<table>
<tr>
<th rowspan=2>Number of documents retrieved</th>
<th rowspan=2>Number of hybrid sub-queries</th>
<th colspan=3>OpenSearch version 2.13</th>
<th colspan=3>OpenSearch version 2.15</th>
<th>Performance improvement</th>
</tr>
<tr>
<th>p50, ms</th>

Check failure on line 107 in _posts/2014-06-26-performance-improvment-hybrid-query-215.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.TableHeadings] 'p50, ms' is a table heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.TableHeadings] 'p50, ms' is a table heading and should be in sentence case.", "location": {"path": "_posts/2014-06-26-performance-improvment-hybrid-query-215.md", "range": {"start": {"line": 107, "column": 7}}}, "severity": "ERROR"}
<th>p90, ms</th>

Check failure on line 108 in _posts/2014-06-26-performance-improvment-hybrid-query-215.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.TableHeadings] 'p90, ms' is a table heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.TableHeadings] 'p90, ms' is a table heading and should be in sentence case.", "location": {"path": "_posts/2014-06-26-performance-improvment-hybrid-query-215.md", "range": {"start": {"line": 108, "column": 7}}}, "severity": "ERROR"}
<th>p99, ms</th>

Check failure on line 109 in _posts/2014-06-26-performance-improvment-hybrid-query-215.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.TableHeadings] 'p99, ms' is a table heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.TableHeadings] 'p99, ms' is a table heading and should be in sentence case.", "location": {"path": "_posts/2014-06-26-performance-improvment-hybrid-query-215.md", "range": {"start": {"line": 109, "column": 7}}}, "severity": "ERROR"}
<th>p50, ms</th>

Check failure on line 110 in _posts/2014-06-26-performance-improvment-hybrid-query-215.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.TableHeadings] 'p50, ms' is a table heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.TableHeadings] 'p50, ms' is a table heading and should be in sentence case.", "location": {"path": "_posts/2014-06-26-performance-improvment-hybrid-query-215.md", "range": {"start": {"line": 110, "column": 7}}}, "severity": "ERROR"}
<th>p90, ms</th>

Check failure on line 111 in _posts/2014-06-26-performance-improvment-hybrid-query-215.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.TableHeadings] 'p90, ms' is a table heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.TableHeadings] 'p90, ms' is a table heading and should be in sentence case.", "location": {"path": "_posts/2014-06-26-performance-improvment-hybrid-query-215.md", "range": {"start": {"line": 111, "column": 7}}}, "severity": "ERROR"}
<th>p99, ms</th>

Check failure on line 112 in _posts/2014-06-26-performance-improvment-hybrid-query-215.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.TableHeadings] 'p99, ms' is a table heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.TableHeadings] 'p99, ms' is a table heading and should be in sentence case.", "location": {"path": "_posts/2014-06-26-performance-improvment-hybrid-query-215.md", "range": {"start": {"line": 112, "column": 7}}}, "severity": "ERROR"}
<th>%</th>

Check failure on line 113 in _posts/2014-06-26-performance-improvment-hybrid-query-215.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.TableHeadings] '%' is a table heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.TableHeadings] '%' is a table heading and should be in sentence case.", "location": {"path": "_posts/2014-06-26-performance-improvment-hybrid-query-215.md", "range": {"start": {"line": 113, "column": 7}}}, "severity": "ERROR"}
</tr>
<tr>
<td class="left"><b>1.6K</b></td>
<td>1</td>
<td>75</td>
<td>77</td>
<td>78</td>
<td>75</td>
<td>76</td>
<td>76</td>
<td class="light-green-clr bold">1</td>
</tr>
<tr>
<td class="left"><b>1.6M</b></td>
<td>1</td>
<td>224</td>
<td>240</td>
<td>245</td>
<td>109</td>
<td>114</td>
<td>119</td>
<td class="light-green-clr bold">52</td>
</tr>
<tr>
<td class="left"><b>10M</b></td>
<td>1</td>
<td>729</td>
<td>841</td>
<td>868</td>
<td>237</td>
<td>257</td>
<td>264</td>
<td class="light-green-clr bold">70</td>
</tr>
<tr>
<td class="left"><b>15M</b></td>
<td>3</td>
<td>1224</td>
<td>1300</td>
<td>1367</td>
<td>294</td>
<td>330</td>
<td>343</td>
<td class="light-green-clr bold">75</td>
</tr>
<tr>
<td colspan=8 class="center bold">% average boost in 2.15 vs 2.13</td>

Check warning on line 160 in _posts/2014-06-26-performance-improvment-hybrid-query-215.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.LatinismsSubstitution] Use 'compared to or compared with' instead of 'vs'. Raw Output: {"message": "[OpenSearch.LatinismsSubstitution] Use 'compared to or compared with' instead of 'vs'.", "location": {"path": "_posts/2014-06-26-performance-improvment-hybrid-query-215.md", "range": {"start": {"line": 160, "column": 61}}}, "severity": "WARNING"}
<td class="light-green-clr bold">49</td>
</tr>
</table>

---

## Future Improvements

We continue to analyze the performance of OpenSearch and identify opportunities for further enhancements. We are considering the following improvements for future versions:

- **Advanced optimization techniques for hybrid query:**
We may iterate over blocks of documents rather than individual ones to further reduce latency and enhance performance. These techniques aim to streamline the processing of hybrid queries by minimizing the computational overhead associated with handling large volumes of data.

- **Algorithmic refinements:**
Refining existing algorithms and introducing new ones that are better suited for hybrid search. This includes optimizing the ranking and scoring mechanisms to ensure more accurate and faster results.

Additionally, the team is considering the following ongoing initiatives to provide continuous performance insights and improvements:

- **Nightly benchmark runs:**
Starting from version 2.15, the team will be publishing the results of hybrid query nightly benchmark runs so users can track changes in performance between versions. These results will be available on the [OpenSearch Benchmarks page](https://opensearch.org/benchmarks/).

- **Enhanced benchmark workloads:**
The addition of more extensions for benchmark workloads to gather metrics for vector search queries in addition to text search queries.

These enhancements aim to broaden the capabilities and improve the efficiency of OpenSearch even further.

---

## References

1. [[META] Improve Hybrid query latency](https://github.com/opensearch-project/neural-search/issues/704)
2. [OpenSearch Benchmark workload for semantic search](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/noaa_semantic_search)
3. [OpenSearch Benchmarks page](https://opensearch.org/benchmarks/)
4. [Improve search relevance with hybrid search, generally available in OpenSearch 2.10](https://opensearch.org/blog/hybrid-search/)
5. [Hybrid query](https://opensearch.org/docs/latest/query-dsl/compound/hybrid/)
Binary file added assets/media/community/members/varunudr.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading