Remove unnecessary override code #2695

daixque · 2024-11-11T13:41:49Z

I tested below code to test DenseVectorStrategy:

strategy = DenseVectorStrategy(
    model_id=".multilingual-e5-small_linux-x86_64",
    hybrid=True,
    text_field="title",   
)
store = ElasticsearchStore(
    es_connection=elasticsearch_client,
    index_name=INDEX,
    vector_query_field="content_semantic.inference.chunks.embeddings",
    query_field="content",
    strategy=strategy,
)

But I realized text_field="title" parameter in the DenseVectorStrategy constructor doesn't affect since it is overridded in:

https://github.com/elastic/elasticsearch-py/blob/main/elasticsearch/helpers/vectorstore/_sync/vectorstore.py#L83-L86

And also there is no vector_field in all of strategy classes.

This PR intends to:

Override text_field field only when it is not provided as part of constructor
Remove unnecessary override code since if hasattr(retrieval_strategy, "vector_field") is always False

github-actions · 2024-11-11T13:42:03Z

A documentation preview will be available soon.

🔨 Buildkite builds
📚 HTML diff
📙 Preview page

Request a new doc build by commenting

Rebuild this PR: run docs-build
Rebuild this PR and all Elastic docs: run docs-build rebuild

_{run docs-build is much faster than run docs-build rebuild. A rebuild should only be needed in rare situations.}

_{If your PR continues to fail for an unknown reason, the doc build pipeline may be broken. Elastic employees can check the pipeline status here.}

miguelgrinberg · 2024-11-11T15:55:41Z

Wouldn't your problem be addressed if you move the text field to the vector store?

strategy = DenseVectorStrategy(
    model_id=".multilingual-e5-small_linux-x86_64",
    hybrid=True,
)
store = ElasticsearchStore(
    es_connection=elasticsearch_client,
    index_name=INDEX,
    vector_query_field="content_semantic.inference.chunks.embeddings",
    query_field="title",   
    strategy=strategy,
)

And also, note that query_field does not exist in our vector store class, it only appears in the Langchain integration, and is in fact converted to text_field when is passed up to the base class. Basically in this example you were running a hybrid search in which the BM25 and KNN sides were using different fields.

So I think you need to first of all decide if you want to use title or content for your search, and then pass the chosen field in the store class as query_field.

Unfortunately this way these constructors were designed is extremely convoluted, which is really the source of the problem you are having. Your fix addresses the way in which you are using this integration but does not do much to improve the overall design, in fact you may not have realized that you are searching two different fields.

daixque · 2024-11-12T00:51:35Z

Hi @miguelgrinberg , thanks for your comment. It is understood. Currently I configured with the same way with your suggestion. Below is just my comments.

If I set query_field="title" in ElasticsearchStore, it will pass to the LLM as part of the prompt and it is not my preference. I'd like to keep the content for the prompt.

My first intention is DenseVectorStrategy provides text_field parameter but its value which set in the constructor never used. I was confused by this behavior, so I'd like to avoid it if possible. If DenseVectorStrategy is only for ElasticsearchStore and it always override text_field, then I feel the constructor parameter doesn't make sense.

Second, if we can set the text_field for DenseVectorStrategy, then we can tune the query as I show in first example, title for BM25, embedding field which comes from content for knn, and full content for query for the LLM prompt. I found this configuration works well. But maybe for further tuning, we should create our own Strategy sub-class.

FYI, here is my (partial) mapping

{
  "properties": {
    "content": {
      "type": "text",
      "copy_to": [
        "content_semantic"
      ],
      "analyzer": "kuromoji"
    },
    "content_semantic": {
      "type": "semantic_text",
      "inference_id": "my-e5-endpoint",
      "model_settings": {
        "task_type": "text_embedding",
        "dimensions": 384,
        "similarity": "cosine",
        "element_type": "float"
      }
    },
    "title": {
      "type": "text",
      "analyzer": "kuromoji"
    }
  }
}

miguelgrinberg · 2024-11-12T11:21:52Z

Right, so you are taking advantage of a design flaw in these classes, and running a hybrid search that uses different fields for the BM25 and kNN portions, which as far as I can see is not an intended usage. You are suggesting that text_field is being ignored, but from my understanding of the design of this integration that is actually the correct behavior, because you are passing query_field in the vectorstore class which takes precendence, so it makes sense for this field to be used in your searches.

If you wanted to keep doing this, then you do not need to change anything, just do this:

strategy = DenseVectorStrategy(
    model_id=".multilingual-e5-small_linux-x86_64",
    hybrid=True,
)
store = ElasticsearchStore(
    es_connection=elasticsearch_client,
    index_name=INDEX,
    vector_query_field="content_semantic.inference.chunks.embeddings",
    query_field="content",   
    strategy=strategy,
)
strategy.text_field = 'title'  # override the field used in the kNN search

daixque · 2024-11-12T12:15:41Z

@miguelgrinberg
Ah, indeed it will work but bit tricky 😇
Anyway thank you so much for your feedback! Now I'm clear.

Remove unnecessary override code

9eeeab7

Apply lint

fd9c24a

daixque closed this Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unnecessary override code #2695

Remove unnecessary override code #2695

daixque commented Nov 11, 2024

github-actions bot commented Nov 11, 2024

miguelgrinberg commented Nov 11, 2024 •

edited

Loading

daixque commented Nov 12, 2024

miguelgrinberg commented Nov 12, 2024 •

edited

Loading

daixque commented Nov 12, 2024

Remove unnecessary override code #2695

Remove unnecessary override code #2695

Conversation

daixque commented Nov 11, 2024

github-actions bot commented Nov 11, 2024

miguelgrinberg commented Nov 11, 2024 • edited Loading

daixque commented Nov 12, 2024

miguelgrinberg commented Nov 12, 2024 • edited Loading

daixque commented Nov 12, 2024

miguelgrinberg commented Nov 11, 2024 •

edited

Loading

miguelgrinberg commented Nov 12, 2024 •

edited

Loading