Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VectorStore for GenAI integrations #2528

Merged
merged 36 commits into from
Apr 30, 2024
Merged

VectorStore for GenAI integrations #2528

merged 36 commits into from
Apr 30, 2024

Conversation

maxjakob
Copy link
Contributor

@maxjakob maxjakob commented Apr 17, 2024

We want to add a higher-level abstraction for using Elasticsearch in Python. The main motivation is to make integrations into 3rd party GenAI libraries easier and give users a handful of options to choose from without diving into the Query DSL. Ideally the abstraction also works for direct users of the elasticsearch-py client.

Integrations

TODO

  • Agree on shape of the interfaces <-- nobody complained ;)
  • Move client creation code to 3rd parties, receive client object
  • Separate classes for sync and async
  • Move embedding service calls to the store, batch the inference
  • Move module to helpers.vectorstore
  • Deduplicate field names between store and strategies (as in the original implementation)
  • Deduplicate application of metadata mapping
  • Semantic strategy: test it once there is a Docker image with it or remove it
  • Make integration tests work in CI
  • Consider limiting values for k and num_candidates for (async) MMR --> CPU-intensive MMR function is now sync only
  • Create extra requirements for MMR-related dependencies
  • Adhere to repo-specific code styles
  • Inquire what needs to be done for the -serverless package --> nothing here, @pquentin will take care of this.
  • Add default arg for user agent
  • Documentation? <-- We can worry about it later when the integrations are stable.

Copy link

A documentation preview will be available soon.

Request a new doc build by commenting
  • Rebuild this PR: run docs-build
  • Rebuild this PR and all Elastic docs: run docs-build rebuild

run docs-build is much faster than run docs-build rebuild. A rebuild should only be needed in rare situations.

If your PR continues to fail for an unknown reason, the doc build pipeline may be broken. Elastic employees can check the pipeline status here.

Copy link
Member

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this draft! I've reviewed with an eye towards the big things that could be improved to make this feel native in the Python client. I've not looked at the detail of the code, since I don't expect to bring value here.

The big question is going to be the name of this module and its location in the client.

  • I don't think store works here: we have to communicate that this is an LLM/GenAI thing. I would thus call it vectorstore or vector_store. (Indeed, according to PEP 8, underscores can be used in the module name if it improves readability.)
  • Currently, all the manually written code is in elasticsearch.helpers which felt like the natural location for this new code initially. But I'm less sure now, given existing helpers are simply functions and this is a lot more code with a lot more state. So I'm tempted to think that elasticsearch.vector_store is a better location than elasticsearch.helpers.vector_store.

elasticsearch/store/_utilities.py Outdated Show resolved Hide resolved
elasticsearch/store/_utilities.py Outdated Show resolved Hide resolved
elasticsearch/store/embedding_service.py Outdated Show resolved Hide resolved
elasticsearch/store/embedding_service.py Outdated Show resolved Hide resolved
elasticsearch/store/store.py Outdated Show resolved Hide resolved
elasticsearch/store/_utilities.py Outdated Show resolved Hide resolved
@maxjakob
Copy link
Contributor Author

It was called vectorstore for the longest time :) I renamed it because of the BM25 capabilities. But you're right, it's more appealing/familiar to people with vector in the name.

@maxjakob
Copy link
Contributor Author

  • Currently, all the manually written code is in elasticsearch.helpers which felt like the natural location for this new code initially. But I'm less sure now, given existing helpers are simply functions and this is a lot more code with a lot more state. So I'm tempted to think that elasticsearch.vector_store is a better location than elasticsearch.helpers.vector_store.

I share this sentiment (but don't have a strong opinion).

@ezimuel
Copy link
Contributor

ezimuel commented Apr 18, 2024

I think vectorstore (or vector_store) is a better name, since it puts evidence on the semantic search feature.

  • Currently, all the manually written code is in elasticsearch.helpers which felt like the natural location for this new code initially. But I'm less sure now, given existing helpers are simply functions and this is a lot more code with a lot more state. So I'm tempted to think that elasticsearch.vector_store is a better location than elasticsearch.helpers.vector_store.

Since this code consumes the client as an additional layer, I think putting it into elasticsearch.helpers namespace is the perfect fit. The helpers should collects all the high level features that facilitate the usage of the Elasticsearch endpoints.

@maxjakob maxjakob changed the title ElasticsearchStore VectorStore for GenAI integrations Apr 18, 2024
Copy link
Member

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed two small things, but please let me know when you need a more complete review.

examples/bulk-ingest/bulk-ingest.py Outdated Show resolved Hide resolved
@maxjakob maxjakob force-pushed the genai-orchestration branch 2 times, most recently from d4a84c1 to 543b49f Compare April 22, 2024 14:32
@maxjakob maxjakob force-pushed the genai-orchestration branch from 543b49f to d397982 Compare April 22, 2024 15:02
@maxjakob maxjakob requested a review from pquentin April 22, 2024 15:20
@maxjakob
Copy link
Contributor Author

maxjakob commented Apr 22, 2024

@pquentin Can I get your eyes on this again? 👀 Some questions from my side:

  1. Do we agree on the general interfaces?
  2. Can you give me some pointers on how to best set up vectorstor as an optional dependency? EDIT: ✅
  3. Do we need to do anything special for the -serverless package? EDIT: ✅

@maxjakob
Copy link
Contributor Author

Adding the new classes to the documentation (this may require reformatting the docstrings to follow the sphinx format, as I noted in the code)

👍 reformatting...

Async fixtures and tests

I will do this in a follow-up PR because to do it right, I would like create 2 directories _async and _sync like in the application code. That would involve moving existing tests and I would like to keep that out of this PR.

- Strategy suffix
- Sphinx docstrings
- add user agent to EmbeddingService
- raise ConflictError
- various cleanup
Copy link
Member

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We looked at this together and found a few really minor formatting things to change. I'll test that quickly and then we can merge! 🎉

test_elasticsearch/test_server/conftest.py Show resolved Hide resolved
elasticsearch/helpers/vectorstore/_async/strategies.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
elasticsearch/helpers/vectorstore/__init__.py Show resolved Hide resolved
raise ValueError("specify a query_vector")

if self.distance is DistanceMetric.COSINE:
similarityAlgo = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We could consider unit testing all those combinations with ReferenceJson and help from GitHub Copilot to generate the tests. But then up to you as this is also quite simple and the tests are going to look a lot like the original code anyway.

Maybe focus on the raised exceptions, using code like with pytest.raises(match="specify a query_vector"): ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing the error states now. Assertions for the ES queries are part of the integration tests.

utils/run-unasync.py Outdated Show resolved Hide resolved
@maxjakob maxjakob force-pushed the genai-orchestration branch from 4f28761 to f32ceb2 Compare April 29, 2024 12:18
@maxjakob maxjakob requested review from ezimuel and pquentin and removed request for ezimuel April 29, 2024 12:19
@maxjakob maxjakob force-pushed the genai-orchestration branch from e00d182 to d27f9f8 Compare April 30, 2024 12:49
Copy link
Member

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM.

@maxjakob maxjakob merged commit c2b0ca3 into main Apr 30, 2024
18 checks passed
@maxjakob maxjakob deleted the genai-orchestration branch April 30, 2024 13:19
github-actions bot pushed a commit that referenced this pull request May 2, 2024
* ElasticsearchStore

* Update elasticsearch/store/_utilities.py

Co-authored-by: Quentin Pradet <[email protected]>

* rename; depend on client; async only

* generate _sync files

* add cleanup step for _sync generation

* fix formatting

* more linting fixes

* batch embedding call; infer num_dimensions

* revert accidental changes

* keep field names only in store; apply metadata mappings in store

* fix typos in file names

* use `elasticsearch_url` fixture; create conftest.py

* export relevant classes

* remove Semantic strategy

wait for `semantic_text` to land

* es_query is sync

* async strategies

* cleanup old file

* add docker-compose service with model deployment

* optional dependencies for MMR

* only test sync parts

* cleanup unasync script

* nox: install optional deps

* fix tests with requests remembering Transport

* fix numpy typing

* add user agent default argument

* move to `elasticsearch.helpers.vectorstore`

* use Protocol over ABC

* revert Protocol change because Python 3.7

* address PR feedback:

- Strategy suffix
- Sphinx docstrings
- add user agent to EmbeddingService
- raise ConflictError
- various cleanup

* improve docstring

* fix metadata mappings issue

* address PR feedback

* add error tests for strategies

* canonical names, keyword args only

* fix sparse vector strategy bug (duplicate `size`)

* all wildcard deletes in compose ES

---------

Co-authored-by: Quentin Pradet <[email protected]>
(cherry picked from commit c2b0ca3)
github-actions bot pushed a commit that referenced this pull request May 2, 2024
* ElasticsearchStore

* Update elasticsearch/store/_utilities.py

Co-authored-by: Quentin Pradet <[email protected]>

* rename; depend on client; async only

* generate _sync files

* add cleanup step for _sync generation

* fix formatting

* more linting fixes

* batch embedding call; infer num_dimensions

* revert accidental changes

* keep field names only in store; apply metadata mappings in store

* fix typos in file names

* use `elasticsearch_url` fixture; create conftest.py

* export relevant classes

* remove Semantic strategy

wait for `semantic_text` to land

* es_query is sync

* async strategies

* cleanup old file

* add docker-compose service with model deployment

* optional dependencies for MMR

* only test sync parts

* cleanup unasync script

* nox: install optional deps

* fix tests with requests remembering Transport

* fix numpy typing

* add user agent default argument

* move to `elasticsearch.helpers.vectorstore`

* use Protocol over ABC

* revert Protocol change because Python 3.7

* address PR feedback:

- Strategy suffix
- Sphinx docstrings
- add user agent to EmbeddingService
- raise ConflictError
- various cleanup

* improve docstring

* fix metadata mappings issue

* address PR feedback

* add error tests for strategies

* canonical names, keyword args only

* fix sparse vector strategy bug (duplicate `size`)

* all wildcard deletes in compose ES

---------

Co-authored-by: Quentin Pradet <[email protected]>
(cherry picked from commit c2b0ca3)
pquentin pushed a commit that referenced this pull request May 2, 2024
* ElasticsearchStore

* Update elasticsearch/store/_utilities.py

Co-authored-by: Quentin Pradet <[email protected]>

* rename; depend on client; async only

* generate _sync files

* add cleanup step for _sync generation

* fix formatting

* more linting fixes

* batch embedding call; infer num_dimensions

* revert accidental changes

* keep field names only in store; apply metadata mappings in store

* fix typos in file names

* use `elasticsearch_url` fixture; create conftest.py

* export relevant classes

* remove Semantic strategy

wait for `semantic_text` to land

* es_query is sync

* async strategies

* cleanup old file

* add docker-compose service with model deployment

* optional dependencies for MMR

* only test sync parts

* cleanup unasync script

* nox: install optional deps

* fix tests with requests remembering Transport

* fix numpy typing

* add user agent default argument

* move to `elasticsearch.helpers.vectorstore`

* use Protocol over ABC

* revert Protocol change because Python 3.7

* address PR feedback:

- Strategy suffix
- Sphinx docstrings
- add user agent to EmbeddingService
- raise ConflictError
- various cleanup

* improve docstring

* fix metadata mappings issue

* address PR feedback

* add error tests for strategies

* canonical names, keyword args only

* fix sparse vector strategy bug (duplicate `size`)

* all wildcard deletes in compose ES

---------

Co-authored-by: Quentin Pradet <[email protected]>
(cherry picked from commit c2b0ca3)

Co-authored-by: Max Jakob <[email protected]>
pquentin pushed a commit that referenced this pull request May 2, 2024
* ElasticsearchStore

* Update elasticsearch/store/_utilities.py

Co-authored-by: Quentin Pradet <[email protected]>

* rename; depend on client; async only

* generate _sync files

* add cleanup step for _sync generation

* fix formatting

* more linting fixes

* batch embedding call; infer num_dimensions

* revert accidental changes

* keep field names only in store; apply metadata mappings in store

* fix typos in file names

* use `elasticsearch_url` fixture; create conftest.py

* export relevant classes

* remove Semantic strategy

wait for `semantic_text` to land

* es_query is sync

* async strategies

* cleanup old file

* add docker-compose service with model deployment

* optional dependencies for MMR

* only test sync parts

* cleanup unasync script

* nox: install optional deps

* fix tests with requests remembering Transport

* fix numpy typing

* add user agent default argument

* move to `elasticsearch.helpers.vectorstore`

* use Protocol over ABC

* revert Protocol change because Python 3.7

* address PR feedback:

- Strategy suffix
- Sphinx docstrings
- add user agent to EmbeddingService
- raise ConflictError
- various cleanup

* improve docstring

* fix metadata mappings issue

* address PR feedback

* add error tests for strategies

* canonical names, keyword args only

* fix sparse vector strategy bug (duplicate `size`)

* all wildcard deletes in compose ES

---------

Co-authored-by: Quentin Pradet <[email protected]>
(cherry picked from commit c2b0ca3)

Co-authored-by: Max Jakob <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants