-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of hybrid search chunking strategy for pinecone, extra-metadata fields for chunks #224
Open
jaisir-shadai
wants to merge
43
commits into
Unstructured-IO:main
Choose a base branch
from
shadai-group:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…id-search-and-extra-metadata Implementation of pinecone hybrid search and extra metadata allowed i…
…om-metadata reading of orig elements
…om-metadata version updated
…om-metadata local function to decode implemented
…om-metadata self added
…om-metadata add json load to reading orig elements
…om-metadata metadata fields added and bedrock model name included in call
…om-metadata default model usage to cohere and region to virginia
--------- Co-authored-by: Mateusz Kuprowski <[email protected]> Co-authored-by: mackurzawa <[email protected]> Co-authored-by: Rob Roskam <[email protected]> Co-authored-by: hubert.rutkowski <[email protected]> Co-authored-by: Roman Isecke <[email protected]>
* split sql into distinct connectors * migrate sqlite tests from e2e to integration * migrate postgres tests from e2e to integration * bump to first minor version due to breaking change * revert changelog version to a dev one * Add comments to test files
…tured-IO#183) * fix how models are mapped to flat data to support optional access configs * make sure connection config access config check passes if it's optional
) * split databricks into each auth type supported * migrate volumes source connector e2e test * migrate volumes destination connector e2e test * bump changelog * Add databricks secrets to int test CI * Add s3 secrets to destination CI * expose azure databricks connector * add missing connector type definition * update changelog to be a minor bump
* update kdbai to latest version * sync changelog
* implement sqlite version of sql indexer and downloader * add integration test for sqlite source connector * Drop ids from copied filedata * migrate env setup over to integration test folder * add postgres source connector with tests * bump changelog
Migrated Slack Source Connector to V2 --------- Co-authored-by: Filip Knefel <[email protected]>
* feat add unfinished first version * fix add pass * update requirements * delta table connector for s3 * fix add AWS REGION * fix add new test to test-dest.sh * code more readable * remove unused import * address feedback * lint * change file extension in stager * add comment for locks in S3 * remove mode, engine, schema_mode from configs * allow for dynamic stagers * add databricks volumes src to __init__ - automatic merge didnt add it * delete pandas from delta-table requirements * add integration test for delta-table-s3 * consider aws_region as non-sensitive parameter * provide better description for table_uri * change import location, remove requires_dependencies before precheck for uploader * use fsspec to clean up after integration test * bump version * bump changelog version, implement uploader precheck, change bash test s3 destination folder * change fsspec import location * create update_storage_options method * upload Delta tables to folders with adequate names * linter * modify local e2e test to leverage new naming convention * Add type annotations to update_storage_options method Co-authored-by: Roman Isecke <[email protected]> * change uploader precheck to write empty file to s3 * reformat blank lines * Delta Table local test leveraging new testing framework * deleted test_e2e/dest/delta-table-s3.sh, test_e2e/dest/delta-table.sh, test_e2e/python/test-ingest-delta-table-output.py, and updated test_e2e/test-dest.sh --------- Co-authored-by: Hubert Rutkowski <[email protected]> Co-authored-by: Roman Isecke <[email protected]>
* Add source connector with test using localstack * Add destination connector with test using localstack * bump changelog * remove hard coded token * Add support to check the contents of downloaded files in integration tests * use dataframe based equality for csv files * Add additional printing for debugging * populate sql db with deterministic content * update sql tests
* Support async indexing in pipeling * bump changelog * fix asyncgenerator typing
* add singlestore source connector * pull all needed docker images at the beginning of the CI job * Add docker logs to error output * delete images after docker compose * shell tidy * bump the github runner for src/dest integration tests * fix minio docker compose path * don't prematurely pull all images * Fix changelog typo * replace deprecated __fields__ with model_fields()
* initial commit * connector update * comment out extension * update expected tests. * update expected * astradb source connector updates * fix downloader * update fixtures * bump dev version * nit * cleanup * address comments * async downloader * update uploader * fixes wip * update response * tidy * make deepcopy of fd * update doc type to file, not csv --------- Co-authored-by: Shreya Nidadavolu <[email protected]> Co-authored-by: shreyanid <[email protected]>
…O#200) * File system based indexers return record display name * Update version and changelog * Fix integration tests * Fix integration tests * return in dedicated FileData field * Lint * Fix changelog * Use less specific name * Fix postgres integration test * fix singlestore integration test * Fix sqlite integration test * Set to release version
…structured-IO#168) Migrate GitLab Source Connector to V2. Introduce `path` parameter which allows to select a location in the repository to be processed. Fix logic of getting `base_url` from full url to require both scheme and netloc to be present. --------- Co-authored-by: Filip Knefel <[email protected]> Co-authored-by: Maciej Kurzawa <[email protected]> Co-authored-by: Hubert Rutkowski <[email protected]> Co-authored-by: Roman Isecke <[email protected]>
…tructured-IO#213) * default overwrite to true for connectors * remove field * fix * fix * .
…nstructured-IO#188) * bump unstructured-client version and leverage new async support * Rename new config * comment out astradb src test for now * reenable astradb CI test * fix azure src ingest test results * drop use of unstructured repo call to api in v1
* Created confluence source v2 connector * Fixed Fields * Correct secret pass * Linter * Fix parameter name * Linter fix * Parameter name fix * Access config fix * Updated precheck * Linter fix * Refactor name * Removed unnecessary parameter * Fixed FileData Issue * version bump * Downloader fix * Overwrite fixtures run * Shfmt * Revert changes * Added source identifiers * Linter fix * Added integration test for conluence * Linter fix * Added necessary secrets * Removed dataclass decorator from config * Fixed input args * Removed unnecessary dataclass from Downloader config * Lint * Arg name correct * Intermediate with test data * Added dir structure stub * Added actual fixtures from overwrite * Linter fix * finish up confluence connector and update integration test * add large test * fix changelog * regenerate fixtures * remove confluence secrets from e2e ci job * add html validation test * comment out clarifai dest test * tidy shell --------- Co-authored-by: Mateusz Kuprowski <[email protected]> Co-authored-by: Roman Isecke <[email protected]>
* Ignore vs stuff * First iteration of OneDrive uplaoder * Added integration test stub * Addidng test stubs * Added changelog * Added precheck to dest * Added output testing * Added working uploader and the integration test * Removed old style test, corrected large files upload * Black run * Removed old checking file, deprecated by integration test * Linter fixes Added upload validation * Lint fix * Revert test-dest.sh after removing old style test * Env var pull test * Added missing secrets * fix test dest * Fixed type-o Co-authored-by: Roman Isecke <[email protected]> * Moved fixtures to the testfile * Added E2E test Added destination directory Adjusted original integration test * Lit fix * Shfmt * Fixed one function * Formatting issue * One more linter * update int test * remove e2e test --------- Co-authored-by: Mateusz Kuprowski <[email protected]> Co-authored-by: Roman Isecke <[email protected]> Co-authored-by: Roman Isecke <[email protected]>
* qdrant v1, changed t. typing import way * stager part not finihed yer * remove migrated function * conn not ready yet * fix Roman PR comments * make tidy * taking optional out from access config * make tidy * add secret to access config * taking api key to connection config * fix collection name. back api key to access config * version-secret * docslistt to element lict * printing writedict params * . * changing variablename. * Refactor to async Refactor qdrant destination to use asynchronous SDK. Deprecate --num-processes due to switch to async. Update E2E test expectation to match V2 pipeline run. Introduce docker based integration test. * Condense doc-string lines * Update parameter descriptions * Fix run_async signature * Remove E2E test in favor of integration. * Capitalize description * Fix collection_name calls and remove some incorrect Optionals Remove Optional from fields which do not take None values. Fix referencing the moved collection_name parameter. * Test QdrantLocal, test embedding Test QdrantLocal in addition to Qdrant with docker server. Test for embedding by querying single point for similarity. * Conform to 100 line limit * Fix test name * expand into different qdrant connectors per auth type * add server integration test --------- Co-authored-by: Filip Knefel <[email protected]> Co-authored-by: Roman Isecke <[email protected]>
* feat: first iteration * Update unstructured_ingest/v2/processes/connectors/kafka.py Co-authored-by: Roman Isecke <[email protected]> * wip * fix: addressing pr comments * feat: bumping version * feat: addressing e2e issues * feat: connectors to review * feat: updating kafka output * feat: linter * fix: linter * feat: address * feat: tidy * feat: adding api-key * feat: tidy * purge: removing old code * Roman's commit suggestion Co-authored-by: Roman Isecke <[email protected]> * Added proper infinite loop limiters and proper reading preset number of messages * Linter fix * Added SourceIndentifiers use * Linter fix * Adjusted SourceIdentifires, not ideal but will do * Accidental copy-paste mistake correction * Lint fix * Fixture rename * update changelog * add int test, remote e2e test * split kafka into local and cloud implementations * fix txt validation and limit number of files to 5 * add back in kafka env setup for dest e2e test * tidy --------- Co-authored-by: Roman Isecke <[email protected]> Co-authored-by: mateuszkuprowski <[email protected]> Co-authored-by: Mateusz Kuprowski <[email protected]> Co-authored-by: Roman Isecke <[email protected]>
* update changelog * tidy
Update local fork
Would this be a replacement of the existing vector generated by the embedder step or does pinecone take in two different vectors so support hybrid search? I noticed in the PR, that the embedding is now down inline with the upload which we want to avoid. If needed, this might require a new embedder to be added and use that as part of the pipeline. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains logic to use Hybrid search in Pinecone connector
this solves #199