VectorStore for GenAI integrations #2528

maxjakob · 2024-04-17T09:09:36Z

We want to add a higher-level abstraction for using Elasticsearch in Python. The main motivation is to make integrations into 3rd party GenAI libraries easier and give users a handful of options to choose from without diving into the Query DSL. Ideally the abstraction also works for direct users of the elasticsearch-py client.

Integrations

LangChain: current and draft branch utilizing this PR
LlamaIndex: current and draft branch utilizing this PR
Haystack: current

TODO

github-actions · 2024-04-17T09:09:50Z

A documentation preview will be available soon.

🔨 Buildkite builds
📚 HTML diff
📙 Preview page

Request a new doc build by commenting

Rebuild this PR: run docs-build
Rebuild this PR and all Elastic docs: run docs-build rebuild

_{run docs-build is much faster than run docs-build rebuild. A rebuild should only be needed in rare situations.}

_{If your PR continues to fail for an unknown reason, the doc build pipeline may be broken. Elastic employees can check the pipeline status here.}

pquentin

Thanks for this draft! I've reviewed with an eye towards the big things that could be improved to make this feel native in the Python client. I've not looked at the detail of the code, since I don't expect to bring value here.

The big question is going to be the name of this module and its location in the client.

I don't think store works here: we have to communicate that this is an LLM/GenAI thing. I would thus call it vectorstore or vector_store. (Indeed, according to PEP 8, underscores can be used in the module name if it improves readability.)
Currently, all the manually written code is in elasticsearch.helpers which felt like the natural location for this new code initially. But I'm less sure now, given existing helpers are simply functions and this is a lot more code with a lot more state. So I'm tempted to think that elasticsearch.vector_store is a better location than elasticsearch.helpers.vector_store.

elasticsearch/store/_utilities.py

elasticsearch/store/embedding_service.py

test_elasticsearch/test_store_integration/docker-compose.yml

elasticsearch/store/embedding_service.py

elasticsearch/store/store.py

elasticsearch/store/_utilities.py

Co-authored-by: Quentin Pradet <[email protected]>

maxjakob · 2024-04-17T14:54:33Z

I don't think store works here: we have to communicate that this is an LLM/GenAI thing. I would thus call it vectorstore or vector_store. (Indeed, according to PEP 8, underscores can be used in the module name if it improves readability.)

It was called vectorstore for the longest time :) I renamed it because of the BM25 capabilities. But you're right, it's more appealing/familiar to people with vector in the name.

maxjakob · 2024-04-17T14:56:16Z

Currently, all the manually written code is in elasticsearch.helpers which felt like the natural location for this new code initially. But I'm less sure now, given existing helpers are simply functions and this is a lot more code with a lot more state. So I'm tempted to think that elasticsearch.vector_store is a better location than elasticsearch.helpers.vector_store.

I share this sentiment (but don't have a strong opinion).

ezimuel · 2024-04-18T07:42:29Z

I don't think store works here: we have to communicate that this is an LLM/GenAI thing. I would thus call it vectorstore or vector_store. (Indeed, according to PEP 8, underscores can be used in the module name if it improves readability.)

I think vectorstore (or vector_store) is a better name, since it puts evidence on the semantic search feature.

Currently, all the manually written code is in elasticsearch.helpers which felt like the natural location for this new code initially. But I'm less sure now, given existing helpers are simply functions and this is a lot more code with a lot more state. So I'm tempted to think that elasticsearch.vector_store is a better location than elasticsearch.helpers.vector_store.

Since this code consumes the client as an additional layer, I think putting it into elasticsearch.helpers namespace is the perfect fit. The helpers should collects all the high level features that facilitate the usage of the Elasticsearch endpoints.

pquentin

Reviewed two small things, but please let me know when you need a more complete review.

test_elasticsearch/test_server/test_vectorstore/test_vectorestore copy.py1

examples/bulk-ingest/bulk-ingest.py

maxjakob · 2024-04-22T15:20:53Z

@pquentin Can I get your eyes on this again? 👀 Some questions from my side:

Do we agree on the general interfaces?
Can you give me some pointers on how to best set up vectorstor as an optional dependency? EDIT: ✅
Do we need to do anything special for the -serverless package? EDIT: ✅

wait for `semantic_text` to land

maxjakob · 2024-04-25T09:16:09Z

Adding the new classes to the documentation (this may require reformatting the docstrings to follow the sphinx format, as I noted in the code)

👍 reformatting...

Async fixtures and tests

I will do this in a follow-up PR because to do it right, I would like create 2 directories _async and _sync like in the application code. That would involve moving existing tests and I would like to keep that out of this PR.

- Strategy suffix - Sphinx docstrings - add user agent to EmbeddingService - raise ConflictError - various cleanup

pquentin

Thanks! We looked at this together and found a few really minor formatting things to change. I'll test that quickly and then we can merge! 🎉

test_elasticsearch/test_server/test_helpers_vectorstore/_test_utils.py

test_elasticsearch/test_server/conftest.py

elasticsearch/helpers/vectorstore/_async/strategies.py

setup.py

elasticsearch/helpers/vectorstore/__init__.py

pquentin · 2024-04-29T09:42:48Z

elasticsearch/helpers/vectorstore/_async/strategies.py

+            raise ValueError("specify a query_vector")
+
+        if self.distance is DistanceMetric.COSINE:
+            similarityAlgo = (


nit: We could consider unit testing all those combinations with ReferenceJson and help from GitHub Copilot to generate the tests. But then up to you as this is also quite simple and the tests are going to look a lot like the original code anyway.

Maybe focus on the raised exceptions, using code like with pytest.raises(match="specify a query_vector"): ....

Testing the error states now. Assertions for the ES queries are part of the integration tests.

test_elasticsearch/test_server/test_helpers_vectorstore/test_vectorstore.py

utils/run-unasync.py

elasticsearch/helpers/vectorstore/_async/vectorstore.py

test_elasticsearch/test_strategies.py

pquentin

Thanks! LGTM.

* ElasticsearchStore * Update elasticsearch/store/_utilities.py Co-authored-by: Quentin Pradet <[email protected]> * rename; depend on client; async only * generate _sync files * add cleanup step for _sync generation * fix formatting * more linting fixes * batch embedding call; infer num_dimensions * revert accidental changes * keep field names only in store; apply metadata mappings in store * fix typos in file names * use `elasticsearch_url` fixture; create conftest.py * export relevant classes * remove Semantic strategy wait for `semantic_text` to land * es_query is sync * async strategies * cleanup old file * add docker-compose service with model deployment * optional dependencies for MMR * only test sync parts * cleanup unasync script * nox: install optional deps * fix tests with requests remembering Transport * fix numpy typing * add user agent default argument * move to `elasticsearch.helpers.vectorstore` * use Protocol over ABC * revert Protocol change because Python 3.7 * address PR feedback: - Strategy suffix - Sphinx docstrings - add user agent to EmbeddingService - raise ConflictError - various cleanup * improve docstring * fix metadata mappings issue * address PR feedback * add error tests for strategies * canonical names, keyword args only * fix sparse vector strategy bug (duplicate `size`) * all wildcard deletes in compose ES --------- Co-authored-by: Quentin Pradet <[email protected]> (cherry picked from commit c2b0ca3)

* ElasticsearchStore * Update elasticsearch/store/_utilities.py Co-authored-by: Quentin Pradet <[email protected]> * rename; depend on client; async only * generate _sync files * add cleanup step for _sync generation * fix formatting * more linting fixes * batch embedding call; infer num_dimensions * revert accidental changes * keep field names only in store; apply metadata mappings in store * fix typos in file names * use `elasticsearch_url` fixture; create conftest.py * export relevant classes * remove Semantic strategy wait for `semantic_text` to land * es_query is sync * async strategies * cleanup old file * add docker-compose service with model deployment * optional dependencies for MMR * only test sync parts * cleanup unasync script * nox: install optional deps * fix tests with requests remembering Transport * fix numpy typing * add user agent default argument * move to `elasticsearch.helpers.vectorstore` * use Protocol over ABC * revert Protocol change because Python 3.7 * address PR feedback: - Strategy suffix - Sphinx docstrings - add user agent to EmbeddingService - raise ConflictError - various cleanup * improve docstring * fix metadata mappings issue * address PR feedback * add error tests for strategies * canonical names, keyword args only * fix sparse vector strategy bug (duplicate `size`) * all wildcard deletes in compose ES --------- Co-authored-by: Quentin Pradet <[email protected]> (cherry picked from commit c2b0ca3) Co-authored-by: Max Jakob <[email protected]>

ElasticsearchStore

f30f1ad

maxjakob added the Category: Enhancement label Apr 17, 2024

pquentin reviewed Apr 17, 2024

View reviewed changes

Update elasticsearch/store/_utilities.py

e03a17f

Co-authored-by: Quentin Pradet <[email protected]>

maxjakob changed the title ~~ElasticsearchStore~~ VectorStore for GenAI integrations Apr 18, 2024

maxjakob added 6 commits April 18, 2024 15:13

rename; depend on client; async only

8ff1c7c

generate _sync files

9be44fd

add cleanup step for _sync generation

7ee3846

fix formatting

2fd89bd

more linting fixes

9387b74

batch embedding call; infer num_dimensions

b18d63d

pquentin reviewed Apr 22, 2024

View reviewed changes

test_elasticsearch/test_server/test_vectorstore/test_vectorestore copy.py1 Outdated Show resolved Hide resolved

examples/bulk-ingest/bulk-ingest.py Outdated Show resolved Hide resolved

maxjakob added 3 commits April 22, 2024 12:04

revert accidental changes

9f83408

keep field names only in store; apply metadata mappings in store

9803414

fix typos in file names

7647961

maxjakob force-pushed the genai-orchestration branch 2 times, most recently from d4a84c1 to 543b49f Compare April 22, 2024 14:32

use elasticsearch_url fixture; create conftest.py

d397982

maxjakob force-pushed the genai-orchestration branch from 543b49f to d397982 Compare April 22, 2024 15:02

maxjakob requested a review from pquentin April 22, 2024 15:20

maxjakob added 5 commits April 23, 2024 15:34

export relevant classes

2f1fcb0

remove Semantic strategy

b19de27

wait for `semantic_text` to land

es_query is sync

274911a

async strategies

8cec9cc

cleanup old file

bbf2be9

maxjakob added 3 commits April 25, 2024 13:09

address PR feedback:

71ca330

- Strategy suffix - Sphinx docstrings - add user agent to EmbeddingService - raise ConflictError - various cleanup

improve docstring

a5dea84

fix metadata mappings issue

6f81af9

pquentin reviewed Apr 29, 2024

View reviewed changes

maxjakob added 2 commits April 29, 2024 13:35

address PR feedback

881d56c

add error tests for strategies

f32ceb2

maxjakob force-pushed the genai-orchestration branch from 4f28761 to f32ceb2 Compare April 29, 2024 12:18

maxjakob requested review from ezimuel and pquentin and removed request for ezimuel April 29, 2024 12:19

pquentin reviewed Apr 30, 2024

View reviewed changes

elasticsearch/helpers/vectorstore/_async/vectorstore.py Outdated Show resolved Hide resolved

elasticsearch/helpers/vectorstore/_async/vectorstore.py Outdated Show resolved Hide resolved

elasticsearch/helpers/vectorstore/_async/vectorstore.py Outdated Show resolved Hide resolved

maxjakob added 3 commits April 30, 2024 14:49

canonical names, keyword args only

9b1778e

fix sparse vector strategy bug (duplicate size)

a8d80f2

all wildcard deletes in compose ES

d27f9f8

maxjakob force-pushed the genai-orchestration branch from e00d182 to d27f9f8 Compare April 30, 2024 12:49

pquentin reviewed Apr 30, 2024

View reviewed changes

test_elasticsearch/test_strategies.py Show resolved Hide resolved

pquentin approved these changes Apr 30, 2024

View reviewed changes

maxjakob merged commit c2b0ca3 into main Apr 30, 2024
18 checks passed

maxjakob deleted the genai-orchestration branch April 30, 2024 13:19

pquentin added backport 8.13 backport 8.14 labels May 2, 2024

github-actions bot mentioned this pull request May 2, 2024

[Backport 8.13] VectorStore for GenAI integrations #2540

Merged

github-actions bot mentioned this pull request May 2, 2024

[Backport 8.14] VectorStore for GenAI integrations #2541

Merged

This was referenced May 3, 2024

Use orchestration lib langchain-ai/langchain-elastic#22

Merged

Integrate VectorStore from Elasticsearch client run-llama/llama_index#13291

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VectorStore for GenAI integrations #2528

VectorStore for GenAI integrations #2528

maxjakob commented Apr 17, 2024 •

edited by pquentin

Loading

github-actions bot commented Apr 17, 2024

pquentin left a comment

maxjakob commented Apr 17, 2024

maxjakob commented Apr 17, 2024

ezimuel commented Apr 18, 2024

pquentin left a comment

maxjakob commented Apr 22, 2024 •

edited

Loading

maxjakob commented Apr 25, 2024

pquentin left a comment

pquentin Apr 29, 2024

maxjakob Apr 29, 2024

pquentin left a comment

VectorStore for GenAI integrations #2528

VectorStore for GenAI integrations #2528

Conversation

maxjakob commented Apr 17, 2024 • edited by pquentin Loading

Integrations

TODO

github-actions bot commented Apr 17, 2024

pquentin left a comment

Choose a reason for hiding this comment

maxjakob commented Apr 17, 2024

maxjakob commented Apr 17, 2024

ezimuel commented Apr 18, 2024

pquentin left a comment

Choose a reason for hiding this comment

maxjakob commented Apr 22, 2024 • edited Loading

maxjakob commented Apr 25, 2024

pquentin left a comment

Choose a reason for hiding this comment

pquentin Apr 29, 2024

Choose a reason for hiding this comment

maxjakob Apr 29, 2024

Choose a reason for hiding this comment

pquentin left a comment

Choose a reason for hiding this comment

maxjakob commented Apr 17, 2024 •

edited by pquentin

Loading

maxjakob commented Apr 22, 2024 •

edited

Loading