-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add pinecone destination connector #1774
Merged
Merged
Changes from 115 commits
Commits
Show all changes
118 commits
Select commit
Hold shift + click to select a range
0389b52
add index creation script
ahmetmeleq f5fe2a2
rebase off main for the changes in ingest cli
ahmetmeleq 655ffb6
trials on bugfix
ahmetmeleq 33f5054
fix dependency name
ahmetmeleq 473f73c
apply roman's updates to pinecone
ahmetmeleq 0ee9e6b
trials on pinecone example
ahmetmeleq c6b1dc5
serially batched upsert with embeddings issue workaround
ahmetmeleq 4bb0b1b
parallelized upsert with session handles
ahmetmeleq 2239494
skip chunking to avoid missing embeddings, remove zipping (another wo…
ahmetmeleq d781698
fix for logging error
ahmetmeleq 6a96193
alphabetic order setup.py
ahmetmeleq 0289dcc
add docs
ahmetmeleq 4bec171
docs
ahmetmeleq 10fb5e7
docs
ahmetmeleq 5a3975c
rearrange imports
ahmetmeleq 397328f
add dependencies
ahmetmeleq 1f6aacb
update example
ahmetmeleq ab73a49
add tests
ahmetmeleq d040010
add pinecone ingest test
ahmetmeleq 03b32bc
obfuscate embedding api keys
ahmetmeleq 9895077
update pinecone cli based on the new cli rebase
ahmetmeleq 9b3096e
shellcheck
ahmetmeleq 099fc4f
changelog and version
ahmetmeleq d1b1045
linting
ahmetmeleq 4fb14b0
linting
ahmetmeleq 5c33688
linting
ahmetmeleq b1069d4
fix chunking node logs
ahmetmeleq 67ccfaf
remove redundant secret from test fixtures update pr job
ahmetmeleq 9dbad76
remove redundant helper script
ahmetmeleq 2470dc7
remove redundant comments in test
ahmetmeleq 2e4dda2
update example
ahmetmeleq ae8598e
fix log in pipeline embedding node
ahmetmeleq 15d2459
change pinecone batching size
ahmetmeleq 0c28c17
add debugging tip
ahmetmeleq 5f39a64
update ingest test with chunking
ahmetmeleq f307a0a
update example with chunking
ahmetmeleq daeecf9
organize requirements
ahmetmeleq 134a8bf
update expected uploads based on the updates in main
ahmetmeleq 9c12d3b
session handle fix
ahmetmeleq 6d84efc
doc, comment and logging updates
ahmetmeleq 84e65e5
test and session creation updates
ahmetmeleq 8d7612e
Merge branch 'main' into ahmet/pinecone-connector
ahmetmeleq 5a438b8
update for cli changes
ahmetmeleq cf77315
do not exclude metadata
ahmetmeleq 17c724b
multiple attempts for testing
ahmetmeleq 0257769
fix path typos on setup.py
ahmetmeleq fec2263
Merge branch 'main' into ahmet/pinecone-connector
ahmetmeleq 830e387
reorder test, update path in test
ahmetmeleq 5ef236f
Merge branch 'main' into ahmet/pinecone-connector
ahmetmeleq ca94947
setup py changes from main
ahmetmeleq 791bf03
ingest test uses huggingface embedder
ahmetmeleq f434c3d
remove comment
ahmetmeleq b7345c7
add secret to test_ingest_dest job
ahmetmeleq 766d485
make batch size a parameter
ahmetmeleq 69e1949
bugfix on chunking params and implementing related test
ahmetmeleq 3fd8c62
Merge branch 'main' into ahmet/pinecone-connector
ahmetmeleq f2786ce
pass metadata fields individually
ahmetmeleq d1b1cd2
Merge branch 'ahmet/pinecone-connector' of https://github.com/Unstruc…
ahmetmeleq c945851
Merge branch 'main' into ahmet/pinecone-connector
ahmetmeleq 4b49fbd
implement check_connection
ahmetmeleq 1ff1fd6
expose writer num_processes, apply parallelization in ingest test
ahmetmeleq 007ad36
fix session handles
ahmetmeleq b4b858a
Merge branch 'main' into ahmet/pinecone-connector
ahmetmeleq 3d81cfd
logging updates
ahmetmeleq fac751e
changelog and version
ahmetmeleq 35be64a
random index names to avoid test run collisions
ahmetmeleq 0440eb2
re-add --chunk-new-after-n-chars
ahmetmeleq 1e8f34e
add support for new_after_n_chars
ahmetmeleq e700a75
check existence of num_processes (dest) when logging
ahmetmeleq 80fed3b
update docs
ahmetmeleq 00b123e
update example and docs
ahmetmeleq 1ead5e0
Merge branch 'main' into ahmet/pinecone-connector
ahmetmeleq bac73f0
changelog
ahmetmeleq 1e6ff4c
fix typo in example
ahmetmeleq c44ab12
index creation retry logic for when another index is being deleted in…
ahmetmeleq 94e66b3
index creation retry logic for when another index is being deleted in…
ahmetmeleq be87dd4
Merge branch 'ahmet/pinecone-connector' of https://github.com/Unstruc…
ahmetmeleq 567ed4e
Merge branch 'ahmet/pinecone-connector' of https://github.com/Unstruc…
ahmetmeleq 074c1ca
Merge branch 'ahmet/pinecone-connector' of https://github.com/Unstruc…
ahmetmeleq 9adefb7
update project variables, update sleep amounts
ahmetmeleq 65fce1c
update docs
ahmetmeleq 3a16a08
update docs
ahmetmeleq 362eb81
Merge branch 'main' into ahmet/pinecone-connector
ahmetmeleq c3266f0
update docs
ahmetmeleq 8a6a0cb
Merge branch 'ahmet/pinecone-connector' of https://github.com/Unstruc…
ahmetmeleq 387a7ad
remove download_dir, remove index creation loop
ahmetmeleq f884123
update example
ahmetmeleq cbd734f
pythonic approach in docs
ahmetmeleq fa083ff
update log
ahmetmeleq 29758f8
move upsert method
ahmetmeleq 14d4e51
Merge branch 'main' into ahmet/pinecone-connector
ahmetmeleq 5723467
shellcheck
ahmetmeleq 992b60d
Merge branch 'main' into ahmet/pinecone-connector
ahmetmeleq 4939113
Update docs/source/ingest/destination_connectors/pinecone.rst
ahmetmeleq 9812d93
Merge branch 'main' into ahmet/pinecone-connector
ahmetmeleq b6e9773
version
ahmetmeleq 4f65b49
s3 docs pythonic approach and local connector
ahmetmeleq 738d75c
add comment on why we use random rather than uuidgen
ahmetmeleq fe818e4
check if test variables are defined before setting
ahmetmeleq 937bdfa
shellcheck double quotes
ahmetmeleq a2b2fc3
update parent classes for cliconfig
ahmetmeleq 940f72d
different number of processes for processor and writer in test
ahmetmeleq e51b88f
Merge branch 'main' into ahmet/pinecone-connector
ahmetmeleq 35701dc
add comment, add field selection from element, add list items separat…
ahmetmeleq 1278379
walrus syntax := instead of if [-z $...] for default parameters
ahmetmeleq 0ec7cae
better type checking for session handles
ahmetmeleq 7b9e02b
implement check_connection
ahmetmeleq 1cf1290
move log for number of (upload) processes from pipeline to connector
ahmetmeleq ca0785e
update embedding docs to have embedding prepend for cli args
ahmetmeleq 83518b0
add potter's flatten lists to flatten dicts
ahmetmeleq e1a6365
make all element fields indexable, add element_serialized
ahmetmeleq f8688e5
Merge branch 'main' into ahmet/pinecone-connector
ahmetmeleq e353c6b
unique ids for pinecone entries rather than using element ids
ahmetmeleq 7e0c7e7
Merge branch 'ahmet/pinecone-connector' of https://github.com/Unstruc…
ahmetmeleq b54e5ce
an additional error wrapper for check connection
ahmetmeleq 07dfbd8
Merge branch 'main' into ahmet/pinecone-connector
ahmetmeleq cebfcb6
changelog and version
ahmetmeleq 5cfef3c
Merge branch 'main' into ahmet/pinecone-connector
ahmetmeleq File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
Pinecone | ||
=========== | ||
|
||
Batch process all your records using ``unstructured-ingest`` to store structured outputs and embeddings locally on your filesystem and upload those to a Pinecone index. | ||
|
||
First you'll need to install the Pinecone dependencies as shown here. | ||
|
||
.. code:: shell | ||
|
||
pip install "unstructured[pinecone]" | ||
|
||
Run Locally | ||
----------- | ||
The upstream connector can be any of the ones supported, but for convenience here, showing a sample command using the | ||
upstream local connector. This will create new files on your local. | ||
|
||
.. tabs:: | ||
|
||
.. tab:: Shell | ||
|
||
.. code:: shell | ||
|
||
unstructured-ingest \ | ||
local \ | ||
--input-path example-docs/book-war-and-peace-1225p.txt \ | ||
--output-dir local-to-pinecone \ | ||
--strategy fast \ | ||
--chunk-elements \ | ||
--embedding-provider <an unstructured embedding provider, ie. langchain-huggingface> \ | ||
--num-processes 2 \ | ||
--verbose \ | ||
--work-dir "<directory for intermediate outputs to be saved>" \ | ||
pinecone \ | ||
--api-key <your pinecone api key here> \ | ||
--index-name <your index name here, ie. ingest-test> \ | ||
--environment <your environment name here, ie. gcp-starter> \ | ||
--batch-size <number of elements to be uploaded per batch, ie. 80> \ | ||
--num-processes <number of processes to be used to upload, ie. 2> | ||
|
||
.. tab:: Python | ||
|
||
.. code:: python | ||
|
||
import os | ||
|
||
from unstructured.ingest.interfaces import PartitionConfig, ProcessorConfig, ReadConfig, ChunkingConfig, EmbeddingConfig | ||
from unstructured.ingest.runner import LocalRunner | ||
if __name__ == "__main__": | ||
runner = LocalRunner( | ||
processor_config=ProcessorConfig( | ||
verbose=True, | ||
output_dir="local-output-to-pinecone", | ||
num_processes=2, | ||
), | ||
read_config=ReadConfig(), | ||
partition_config=PartitionConfig(), | ||
chunking_config=ChunkingConfig( | ||
chunk_elements=True | ||
), | ||
embedding_config=EmbeddingConfig( | ||
provider="langchain-huggingface", | ||
), | ||
writer_type="pinecone", | ||
writer_kwargs={ | ||
"api_key": os.getenv("PINECONE_API_KEY"), | ||
"index_name": os.getenv("PINECONE_INDEX_NAME"), | ||
"environment": os.getenv("PINECONE_ENVIRONMENT_NAME"), | ||
"batch_size": 80, | ||
"num_processes": 2, | ||
} | ||
) | ||
runner.run( | ||
input_path="example-docs/fake-memo.pdf", | ||
) | ||
|
||
|
||
For a full list of the options the CLI accepts check ``unstructured-ingest <upstream connector> pinecone --help``. | ||
|
||
NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you're running this locally. You can find more information about this in the `installation guide <https://unstructured-io.github.io/unstructured/installing.html>`_. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
S3 | ||
=========== | ||
|
||
Batch process all your records using ``unstructured-ingest`` to store structured outputs locally on your filesystem and upload those local files to an S3 bucket. | ||
|
||
First you'll need to install the S3 dependencies as shown here. | ||
|
||
.. code:: shell | ||
|
||
pip install "unstructured[s3]" | ||
|
||
Run Locally | ||
----------- | ||
The upstream connector can be any of the ones supported, but for convenience here, showing a sample command using the | ||
upstream local connector. This will create new files on your local. | ||
|
||
.. tabs:: | ||
|
||
.. tab:: Shell | ||
|
||
.. code:: shell | ||
|
||
unstructured-ingest \ | ||
local \ | ||
--input-path example-docs/book-war-and-peace-1225p.txt \ | ||
--output-dir local-to-s3 \ | ||
--strategy fast \ | ||
--chunk-elements \ | ||
--embedding-provider <an unstructured embedding provider, ie. langchain-huggingface> \ | ||
--num-processes 2 \ | ||
--verbose \ | ||
--work-dir "<directory for intermediate outputs to be saved>" \ | ||
s3 \ | ||
--anonymous \ | ||
--remote-url "<your destination path here, ie 's3://unstructured/war-and-peace-output'>" | ||
|
||
.. tab:: Python | ||
|
||
.. code:: python | ||
|
||
import os | ||
|
||
from unstructured.ingest.interfaces import PartitionConfig, ProcessorConfig, ReadConfig, ChunkingConfig, EmbeddingConfig | ||
from unstructured.ingest.runner import LocalRunner | ||
if __name__ == "__main__": | ||
runner = LocalRunner( | ||
processor_config=ProcessorConfig( | ||
verbose=True, | ||
output_dir="local-output-to-s3", | ||
num_processes=2, | ||
), | ||
read_config=ReadConfig(), | ||
partition_config=PartitionConfig(), | ||
chunking_config=ChunkingConfig( | ||
chunk_elements=True | ||
), | ||
embedding_config=EmbeddingConfig( | ||
provider="langchain-huggingface", | ||
), | ||
writer_type="s3", | ||
writer_kwargs={ | ||
"anonymous": True, | ||
"--remote-url": "<your destination path here, ie 's3://unstructured/war-and-peace-output'>", | ||
} | ||
) | ||
runner.run( | ||
input_path="example-docs/book-war-and-peace-1225p.txt", | ||
) | ||
|
||
|
||
For a full list of the options the CLI accepts check ``unstructured-ingest <upstream connector> s3 --help``. | ||
|
||
NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you're running this locally. You can find more information about this in the `installation guide <https://unstructured-io.github.io/unstructured/installing.html>`_. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
#!/usr/bin/env bash | ||
|
||
# Processes all the files from s3://utic-dev-tech-fixtures/small-pdf-set/, | ||
# embeds the processed documents, and writes to results to a Pinecone index. | ||
|
||
# Structured outputs are stored in s3-small-batch-output-to-pinecone/ | ||
|
||
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) | ||
cd "$SCRIPT_DIR"/../../.. || exit 1 | ||
|
||
|
||
# As an example we're using the s3 source connector, | ||
# however ingesting from any supported source connector is possible. | ||
# shellcheck disable=2094 | ||
PYTHONPATH=. ./unstructured/ingest/main.py \ | ||
local \ | ||
--input-path example-docs/book-war-and-peace-1225p.txt \ | ||
--output-dir local-to-pinecone \ | ||
--strategy fast \ | ||
--chunk-elements \ | ||
--embedding-provider <an unstructured embedding provider, ie. langchain-huggingface> \ | ||
--num-processes 2 \ | ||
--verbose \ | ||
--work-dir "<directory for intermediate outputs to be saved>" \ | ||
pinecone \ | ||
--api-key "<Pinecone API Key to write into a Pinecone index>" \ | ||
--index-name "<Pinecone index name, ie: ingest-test>" \ | ||
--environment "<Pinecone index name, ie: ingest-test>" \ | ||
--batch-size "<Number of elements to be uploaded per batch, ie. 80>" \ | ||
--num-processes "<Number of processes to be used to upload, ie. 2>" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
-c constraints.in | ||
-c base.txt | ||
pinecone-client |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# | ||
# This file is autogenerated by pip-compile with Python 3.10 | ||
# by the following command: | ||
# | ||
# pip-compile requirements/ingest-pinecone.in | ||
# | ||
certifi==2023.7.22 | ||
# via | ||
# -c requirements/base.txt | ||
# -c requirements/constraints.in | ||
# requests | ||
charset-normalizer==3.3.0 | ||
# via | ||
# -c requirements/base.txt | ||
# requests | ||
dnspython==2.4.2 | ||
# via pinecone-client | ||
idna==3.4 | ||
# via | ||
# -c requirements/base.txt | ||
# requests | ||
loguru==0.7.2 | ||
# via pinecone-client | ||
numpy==1.24.4 | ||
# via | ||
# -c requirements/base.txt | ||
# -c requirements/constraints.in | ||
# pinecone-client | ||
pinecone-client==2.2.4 | ||
# via -r requirements/ingest-pinecone.in | ||
python-dateutil==2.8.2 | ||
# via pinecone-client | ||
pyyaml==6.0.1 | ||
# via pinecone-client | ||
requests==2.31.0 | ||
# via | ||
# -c requirements/base.txt | ||
# pinecone-client | ||
six==1.16.0 | ||
# via | ||
# -c requirements/base.txt | ||
# python-dateutil | ||
tqdm==4.66.1 | ||
# via | ||
# -c requirements/base.txt | ||
# pinecone-client | ||
typing-extensions==4.8.0 | ||
# via | ||
# -c requirements/base.txt | ||
# pinecone-client | ||
urllib3==1.26.18 | ||
# via | ||
# -c requirements/base.txt | ||
# -c requirements/constraints.in | ||
# pinecone-client | ||
# requests |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should leverage the same
num_processes
being set in theProcessorConfig
. Actually not sure if this is causing a duplicate key error in the CLI itself...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming it's the same concern, check #1774 (comment)