Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support specifying an encoder via unstructured-ingest #1782

Closed
ryannikolaidis opened this issue Oct 18, 2023 · 0 comments · Fixed by #1918
Closed

feat: support specifying an encoder via unstructured-ingest #1782

ryannikolaidis opened this issue Oct 18, 2023 · 0 comments · Fixed by #1918
Labels
enhancement New feature or request

Comments

@ryannikolaidis
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Currently we only support an OpenAI encoder in order to fetch embeddings for given results (list of Elements). By extension, when a user sets the embedding-api-key flag, we assume they are setting an OpenAI key and using that encoder. After #1738 and #1619 merge, we will have two additional encoding options, but no way to use these through unstructured-ingest.

Describe the solution you'd like
Unstructured-ingest cli and Runners should support an option that allows the user to specify the encoder they want to leverage to create embeddings for their results.

Describe alternatives you've considered
NA

Additional context
NA

@ryannikolaidis ryannikolaidis added enhancement New feature or request ingest labels Oct 18, 2023
github-merge-queue bot pushed a commit that referenced this issue Nov 6, 2023
…d deterministic ingest test for embeddings (#1918)

Closes #1782 

This PR:
- Extends ingest pipeline so that it is possible to select an embedding
provider from a range of providers
- Modifies the ingest embedding test to be a diff test, since the
embedding vectors are reproducible after supporting multiple providers

Additional info on the chosen provider for the test:
- Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic
even when there's no seed set
- Took 6.84s to pass a unit test with the provider (without cache,
including model download)
- `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it
zero cost

For all these reasons, testing embedding modules with the Huggingface
model seems to be making sense

---------

Co-authored-by: cragwolfe <[email protected]>
Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: ahmetmeleq <[email protected]>
@github-project-automation github-project-automation bot moved this from Todo to Done in unstructured-ingest Nov 6, 2023
shreyanid pushed a commit that referenced this issue Nov 6, 2023
…d deterministic ingest test for embeddings (#1918)

Closes #1782 

This PR:
- Extends ingest pipeline so that it is possible to select an embedding
provider from a range of providers
- Modifies the ingest embedding test to be a diff test, since the
embedding vectors are reproducible after supporting multiple providers

Additional info on the chosen provider for the test:
- Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic
even when there's no seed set
- Took 6.84s to pass a unit test with the provider (without cache,
including model download)
- `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it
zero cost

For all these reasons, testing embedding modules with the Huggingface
model seems to be making sense

---------

Co-authored-by: cragwolfe <[email protected]>
Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: ahmetmeleq <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
1 participant