Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE INTAKE] Improving Search relevancy through Reranker interfaces #542

Closed
4 of 5 tasks
martin-gaievski opened this issue Jan 17, 2024 · 9 comments
Closed
4 of 5 tasks
Labels
Features Introduces a new unit of functionality that satisfies a requirement

Comments

@martin-gaievski
Copy link
Member

martin-gaievski commented Jan 17, 2024

This document captures the activities that needs to be performed in order to prepare the Re-ranking Feature #485 for release.

Release Activities

Below are the release activities that needs to be done to ensure that Re-ranking feature can be merged in 2.12 release of OpenSearch.
Code Freeze Date: Feb 6, 2024
Release calendar: https://opensearch.org/releases.html

  • PR Merge
  • Application Security Approval
  • Benchmarking
  • Documentation
  • Feature Demo

PR Merge

Once the PR is approved, it can be merged to the feature branch. The PR will move to main and 2.x branch once the security review and Benchmarking is done. We can wait for documentation to be completed.

Status: Completed

Application Security Approval

Status: In progress

Benchmarking

To ensure that this feature is fully tested and we are aware of latency and search relevancy impact team needs to run the benchmarks.

Status: Not started

Benchmarking Details

Benchmarking tool: https://github.com/martin-gaievski/info-retrieval-test/tree/score-normalization-combination-testing/beir/retrieval

Cluster Configuration
Config Key Value
Data nodes 3
Data nodes Type r5.8xlarge
Master Node 1
Master Node Type c4.2xlarge
ML nodes Use Datanodes as ML Nodes
ML Node Type NA
ML Model Link https://huggingface.co/sentence-transformers/msmarco-distilbert-base-tas-b
Heap Size 32gb
   
Number of Shards 12
Number of replicas 1
Number of Segments No force merge is required.
Refresh Interval default
   
Bulk Size 200
Bulk Client 1
   
Search Clients 1
   
K 100
KNN Algorithrm hnsw
size 100
KNN engine nmslib
space type inner product
dimensions 768
Data sets
Data set Name Link To download data set Model Zip file Name Model File Link
NFCorpus https://github.com/martin-gaievski/info-retrieval-test/blob/score-normalization-combination-testing/README.md#beers-available-datasets nfcorpus_traced.zip https://huggingface.co/navneet1v/finetunedmodels/tree/main
Trec-Covid https://github.com/martin-gaievski/info-retrieval-test/blob/score-normalization-combination-testing/README.md#beers-available-datasets trec_covid_tuned.zip https://huggingface.co/navneet1v/finetunedmodels/tree/main
Scidocs https://github.com/martin-gaievski/info-retrieval-test/blob/score-normalization-combination-testing/README.md#beers-available-datasets scidocs_tuned.zip https://huggingface.co/navneet1v/finetunedmodels/tree/main
Quora https://github.com/martin-gaievski/info-retrieval-test/blob/score-normalization-combination-testing/README.md#beers-available-datasets quora_tuned.zip https://huggingface.co/navneet1v/finetunedmodels/tree/main
Amazon ESCI https://github.com/amazon-science/esci-data?tab=readme-ov-file#usage amazon_traced.zip https://huggingface.co/navneet1v/finetunedmodels/tree/main
DBPedia https://github.com/martin-gaievski/info-retrieval-test/blob/score-normalization-combination-testing/README.md#beers-available-datasets dbpedia_tuned.zip https://huggingface.co/navneet1v/finetunedmodels/tree/main
FiQA https://github.com/martin-gaievski/info-retrieval-test/blob/score-normalization-combination-testing/README.md#beers-available-datasets fiqa_tuned.zip https://huggingface.co/navneet1v/finetunedmodels/tree/main
Re-ranking Model

Use the models which are Opensource so that the results can be reproduced by other users.

Model Names Link To download model
   
   
   
   

Once you have the results please paste the result table on the RFC. The benchmarking results will be reviewed by the maintainers of the Neural Search plugin to understand the tradeoffs. As a general guidance you should be able justify the latency trade-offs(if any) with improved relevancy metrics.

Documentation

As this is a new feature the feature owner needs to start writing the documentation for this feature. This task has not been started yet. We can add the new section under this table of Search Relevancy. https://opensearch.org/docs/latest/search-plugins/search-relevance/index/.

Status: Not started

Expectation from Documentation:

  1. A working example should be provided to outline how to use the feature.
  2. Examples needs to be provided on how upload a re-ranking local model.
  3. Example ML commons blue print needs to be added and linked in this documentation on how to use remote re-ranking model like Cohere.
  4. Example and details needs to be provided on how the query and processor can be configured. All different permutation and combination needs to be added.
  5. Add a section of limitation if present.
  6. If not planning to publish a blog on this, add the benchmarking details with the documentation.

Feature Demo

As this is new feature we need to a feature demo for this. Aryn team can provide a demo video.

Status: Not started

@martin-gaievski martin-gaievski added the Features Introduces a new unit of functionality that satisfies a requirement label Jan 17, 2024
@HenryL27
Copy link
Contributor

@martin-gaievski is there any chance you have a cloud formation script or otherwise that sets up the benchmarking cluster or am I on my own?

@martin-gaievski
Copy link
Member Author

if you are ok with hosting cluster in AWS then you can use this tool - https://github.com/opensearch-project/opensearch-cluster-cdk. as your code isn't part of the official build you'll need to replace neural-search artifacts in data nodes before running benchmarks. You can try to build a complete deployable tarball using this tool https://github.com/opensearch-project/opensearch-build/, but setup maybe be a bit complex, so I suggest for one time test like this to use cluster-cdk. You can take this latest 2.12 distribution build as a basis:

https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.12.0/8999/linux/x64/tar/dist/opensearch/opensearch-2.12.0-linux-x64.tar.gz
https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.12.0/8999/linux/arm64/tar/dist/opensearch/opensearch-2.12.0-linux-arm64.tar.gz

@HenryL27
Copy link
Contributor

HenryL27 commented Feb 5, 2024

Performance benchmarks:

First a couple notes:

  • Cluster configuration: 3 r5.8xlarge, 1 c4.2xlarge
  • I did not load dbpedia as I ran out of time (or overloaded the cluster with too many requests. both happened)
  • I did not test trec-covid as I ingested it with a different tool that didn't play well with the benchmarking script (mappings were misaligned)
  • I didn't even try Amazon ESCI
  • The benchmarking script appears to use the "took" field of search responses to measure time. This makes sense. Unfortunately, it appears that this field is not updated by search pipelines, so it's basically useless for the purposes of this experiment. Instead, I measured total time from request to response directly in the benchmarking script. So all measurements include the network call timing (ssh from my laptop to AWS (IAD) cluster). I wouldn't be surprised if that contributes in large part to some of the baseline measurements.
  • I only ran 300 queries from each dataset. This was essential for something like quora, which has 10,000 queries, and since reranking is rather resource intensive, that adds up quickly (I needed to finish this month)
  • I did not run these on a GPU instance. I was going for fidelity to the cluster specification listed above, but given my other issues I'm not sure that was worth it. But that would probably speed up the model inferences significantly.
  • All embeddings for neural search were created with "sentence-transformers/all-MiniLM-L6-v2", and all reranking experiments were done over neural searches
  • I used size=50 for all tests. Reranking latency scales linearly with size (except maybe if you used a remote API like cohere. But these are locally hosted model tests)
    Ok, table
dataset model p50 (ms) p90 (ms) p99 (ms) ndcg@10
fiqa bm25 156.0 182.0 268.03 0.2175
fiqa neural, no reranking 506.0 974.4 1,335.25 0.3859
fiqa MiniLM-L-6-v2 2,177.5 2,372.2 2558.25 0.3620
fiqa bge-rerank-base 12,542.5 13,436.7 14,142.18
fiqa bge-rerank-base (quantized) 2,780.0 3,776.7 4,604.32 0.3217
nfcorpus bm25 155.0 179.1 229.23 0.3018
nfcorpus neural, no reranking 446.0 948.2 1,326.14 0.3140
nfcorpus MiniLM-L-6-v2 5,500.0 6,203.6 6,692.94 0.3352
nfcorpus bge-rerank-base 13,018.0 13,742.8 14,221.22
nfcorpus bge-rerank-base (quantized) 4,438.5 5,391.5 6,115.67 0.2987
quora bm25 157.0 182.0 262.13 0.7230
quora neural, no reranking 506.0 952.7 1,307.02 0.8920
quora MiniLM-L-6-v2 993.0 1,207.2 1,541.17 0.8475
quora bge-rerank-base 5,497.5 6,361.3 7,446.49 0.6711
quora bge-rerank-base (quantized) 644.5 1,024.3 1,334.68 0.7074
scidocs bm25 156.0 182.0 249.17 0.1461
scidocs neural, no reranking 509.0 961.4 1,302.03 0.2180
scidocs MiniLM-L-6-v2 2,150.0 2,348.7 2,705.5 0.1696
scidocs bge-rerank-base 12,998.5 13,967.9 15,089.01
scidocs bge-rerank-base (quantized) 3,171.5 3,955.2 4,911.07 0.1477

If these seem like rather lackluster results, that's because they kinda are... I think I may have prepared bge wrong?

@martin-gaievski
Copy link
Member Author

@HenryL27 thank you for sharing these results. For the review we also need results for search relevancy, mainly nDCG@10, format can be a simplified version of one that shared in the blog for Hybrid query search

@navneet1v
Copy link
Collaborator

@HenryL27 I am closing this issue. As the feature is released.

@amitgalitz
Copy link
Member

hey @HenryL27 I am currently trying to replicate this results for reranking, I wanted to ask what you meant by this statement:

All embeddings for neural search were created with "sentence-transformers/all-MiniLM-L6-v2", and all reranking experiments were done over neural searches

For neural search benchmarking I am following what is in the example given by Martin to test search latency with neural query. However for reranking based on the docs I see for the feature, the query isn't a neural query. What do you mean by "all reranking experiments were done over neural searches"?

@HenryL27
Copy link
Contributor

reranking is an extension of a query - you run a query and then you rerank the results from that query.

All embeddings for neural search were created with "sentence-transformers/all-MiniLM-L6-v2", and all reranking experiments were done over neural searches

means that the base query I used (and subsequently extended with reranking) was a neural query with minilm.

@amitgalitz
Copy link
Member

reranking is an extension of a query - you run a query and then you rerank the results from that query.

All embeddings for neural search were created with "sentence-transformers/all-MiniLM-L6-v2", and all reranking experiments were done over neural searches

means that the base query I used (and subsequently extended with reranking) was a neural query with minilm.

I see so the index you queried against was an knn index (or at least had embedding stored in one of the fields)? In the doc website I saw:

POST /_search?search_pipeline=rerank_pipeline
{
  "query": {
    "match": {
      "text_representation": {
        "query": "Where is Albuquerque?"
      }
    }
  },
  "ext": {
    "rerank": {
      "query_context": {
        "query_text_path": "query.match.text_representation.query"
      }
    }
  }
}

is it more relevant to benchmarking of a query that looks like this:

{
    "query": {
        "neural": {
          "passage_embedding": {
            "query_text": "Hi world",
            "k": 100
          }
        }
      },
  "ext": {
    "rerank": {
      "query_context": {
        "query_text_path": "query.neural.passage_embedding.query_text"
      }
    }
  }
}

My goal is to get a good base line for p50, p90, p99 query latency for reranker neural clause for comparison.

@HenryL27
Copy link
Contributor

yeah, the second is essentially what I did in benchmarking. In general the reranking latency is gonna be substantially higher than the neural or bm25, just by nature of the computations that are going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement
Projects
None yet
Development

No branches or pull requests

5 participants