Implementation of hybrid search chunking strategy for pinecone, extra-metadata fields for chunks #224

jaisir-shadai · 2024-11-10T19:02:16Z

This PR contains logic to use Hybrid search in Pinecone connector

Create sparse vectors using splade
upsert dense + sparse vectors into pinecone
Allow to upsert to specific namespace
Allow the usage of extra metadata to save

this solves #199

…n stager

…id-search-and-extra-metadata Implementation of pinecone hybrid search and extra metadata allowed i…

…om-metadata reading of orig elements

…om-metadata version updated

…om-metadata local function to decode implemented

…om-metadata self added

…om-metadata add json load to reading orig elements

…om-metadata metadata fields added and bedrock model name included in call

…om-metadata default model usage to cohere and region to virginia

--------- Co-authored-by: Mateusz Kuprowski <[email protected]> Co-authored-by: mackurzawa <[email protected]> Co-authored-by: Rob Roskam <[email protected]> Co-authored-by: hubert.rutkowski <[email protected]> Co-authored-by: Roman Isecke <[email protected]>

* split sql into distinct connectors * migrate sqlite tests from e2e to integration * migrate postgres tests from e2e to integration * bump to first minor version due to breaking change * revert changelog version to a dev one * Add comments to test files

…tured-IO#183) * fix how models are mapped to flat data to support optional access configs * make sure connection config access config check passes if it's optional

) * split databricks into each auth type supported * migrate volumes source connector e2e test * migrate volumes destination connector e2e test * bump changelog * Add databricks secrets to int test CI * Add s3 secrets to destination CI * expose azure databricks connector * add missing connector type definition * update changelog to be a minor bump

* update kdbai to latest version * sync changelog

* implement sqlite version of sql indexer and downloader * add integration test for sqlite source connector * Drop ids from copied filedata * migrate env setup over to integration test folder * add postgres source connector with tests * bump changelog

) * feat: add sampling functionality to fsspec indexers * add e2e test * tidy * release

Migrated Slack Source Connector to V2 --------- Co-authored-by: Filip Knefel <[email protected]>

* feat add unfinished first version * fix add pass * update requirements * delta table connector for s3 * fix add AWS REGION * fix add new test to test-dest.sh * code more readable * remove unused import * address feedback * lint * change file extension in stager * add comment for locks in S3 * remove mode, engine, schema_mode from configs * allow for dynamic stagers * add databricks volumes src to __init__ - automatic merge didnt add it * delete pandas from delta-table requirements * add integration test for delta-table-s3 * consider aws_region as non-sensitive parameter * provide better description for table_uri * change import location, remove requires_dependencies before precheck for uploader * use fsspec to clean up after integration test * bump version * bump changelog version, implement uploader precheck, change bash test s3 destination folder * change fsspec import location * create update_storage_options method * upload Delta tables to folders with adequate names * linter * modify local e2e test to leverage new naming convention * Add type annotations to update_storage_options method Co-authored-by: Roman Isecke <[email protected]> * change uploader precheck to write empty file to s3 * reformat blank lines * Delta Table local test leveraging new testing framework * deleted test_e2e/dest/delta-table-s3.sh, test_e2e/dest/delta-table.sh, test_e2e/python/test-ingest-delta-table-output.py, and updated test_e2e/test-dest.sh --------- Co-authored-by: Hubert Rutkowski <[email protected]> Co-authored-by: Roman Isecke <[email protected]>

* Add source connector with test using localstack * Add destination connector with test using localstack * bump changelog * remove hard coded token * Add support to check the contents of downloaded files in integration tests * use dataframe based equality for csv files * Add additional printing for debugging * populate sql db with deterministic content * update sql tests

* Support async indexing in pipeling * bump changelog * fix asyncgenerator typing

* add singlestore source connector * pull all needed docker images at the beginning of the CI job * Add docker logs to error output * delete images after docker compose * shell tidy * bump the github runner for src/dest integration tests * fix minio docker compose path * don't prematurely pull all images * Fix changelog typo * replace deprecated __fields__ with model_fields()

* initial commit * connector update * comment out extension * update expected tests. * update expected * astradb source connector updates * fix downloader * update fixtures * bump dev version * nit * cleanup * address comments * async downloader * update uploader * fixes wip * update response * tidy * make deepcopy of fd * update doc type to file, not csv --------- Co-authored-by: Shreya Nidadavolu <[email protected]> Co-authored-by: shreyanid <[email protected]>

…O#200) * File system based indexers return record display name * Update version and changelog * Fix integration tests * Fix integration tests * return in dedicated FileData field * Lint * Fix changelog * Use less specific name * Fix postgres integration test * fix singlestore integration test * Fix sqlite integration test * Set to release version

…structured-IO#168) Migrate GitLab Source Connector to V2. Introduce `path` parameter which allows to select a location in the repository to be processed. Fix logic of getting `base_url` from full url to require both scheme and netloc to be present. --------- Co-authored-by: Filip Knefel <[email protected]> Co-authored-by: Maciej Kurzawa <[email protected]> Co-authored-by: Hubert Rutkowski <[email protected]> Co-authored-by: Roman Isecke <[email protected]>

…tructured-IO#213) * default overwrite to true for connectors * remove field * fix * fix * .

…nstructured-IO#188) * bump unstructured-client version and leverage new async support * Rename new config * comment out astradb src test for now * reenable astradb CI test * fix azure src ingest test results * drop use of unstructured repo call to api in v1

* Created confluence source v2 connector * Fixed Fields * Correct secret pass * Linter * Fix parameter name * Linter fix * Parameter name fix * Access config fix * Updated precheck * Linter fix * Refactor name * Removed unnecessary parameter * Fixed FileData Issue * version bump * Downloader fix * Overwrite fixtures run * Shfmt * Revert changes * Added source identifiers * Linter fix * Added integration test for conluence * Linter fix * Added necessary secrets * Removed dataclass decorator from config * Fixed input args * Removed unnecessary dataclass from Downloader config * Lint * Arg name correct * Intermediate with test data * Added dir structure stub * Added actual fixtures from overwrite * Linter fix * finish up confluence connector and update integration test * add large test * fix changelog * regenerate fixtures * remove confluence secrets from e2e ci job * add html validation test * comment out clarifai dest test * tidy shell --------- Co-authored-by: Mateusz Kuprowski <[email protected]> Co-authored-by: Roman Isecke <[email protected]>

* Ignore vs stuff * First iteration of OneDrive uplaoder * Added integration test stub * Addidng test stubs * Added changelog * Added precheck to dest * Added output testing * Added working uploader and the integration test * Removed old style test, corrected large files upload * Black run * Removed old checking file, deprecated by integration test * Linter fixes Added upload validation * Lint fix * Revert test-dest.sh after removing old style test * Env var pull test * Added missing secrets * fix test dest * Fixed type-o Co-authored-by: Roman Isecke <[email protected]> * Moved fixtures to the testfile * Added E2E test Added destination directory Adjusted original integration test * Lit fix * Shfmt * Fixed one function * Formatting issue * One more linter * update int test * remove e2e test --------- Co-authored-by: Mateusz Kuprowski <[email protected]> Co-authored-by: Roman Isecke <[email protected]> Co-authored-by: Roman Isecke <[email protected]>

* qdrant v1, changed t. typing import way * stager part not finihed yer * remove migrated function * conn not ready yet * fix Roman PR comments * make tidy * taking optional out from access config * make tidy * add secret to access config * taking api key to connection config * fix collection name. back api key to access config * version-secret * docslistt to element lict * printing writedict params * . * changing variablename. * Refactor to async Refactor qdrant destination to use asynchronous SDK. Deprecate --num-processes due to switch to async. Update E2E test expectation to match V2 pipeline run. Introduce docker based integration test. * Condense doc-string lines * Update parameter descriptions * Fix run_async signature * Remove E2E test in favor of integration. * Capitalize description * Fix collection_name calls and remove some incorrect Optionals Remove Optional from fields which do not take None values. Fix referencing the moved collection_name parameter. * Test QdrantLocal, test embedding Test QdrantLocal in addition to Qdrant with docker server. Test for embedding by querying single point for similarity. * Conform to 100 line limit * Fix test name * expand into different qdrant connectors per auth type * add server integration test --------- Co-authored-by: Filip Knefel <[email protected]> Co-authored-by: Roman Isecke <[email protected]>

* feat: first iteration * Update unstructured_ingest/v2/processes/connectors/kafka.py Co-authored-by: Roman Isecke <[email protected]> * wip * fix: addressing pr comments * feat: bumping version * feat: addressing e2e issues * feat: connectors to review * feat: updating kafka output * feat: linter * fix: linter * feat: address * feat: tidy * feat: adding api-key * feat: tidy * purge: removing old code * Roman's commit suggestion Co-authored-by: Roman Isecke <[email protected]> * Added proper infinite loop limiters and proper reading preset number of messages * Linter fix * Added SourceIndentifiers use * Linter fix * Adjusted SourceIdentifires, not ideal but will do * Accidental copy-paste mistake correction * Lint fix * Fixture rename * update changelog * add int test, remote e2e test * split kafka into local and cloud implementations * fix txt validation and limit number of files to 5 * add back in kafka env setup for dest e2e test * tidy --------- Co-authored-by: Roman Isecke <[email protected]> Co-authored-by: mateuszkuprowski <[email protected]> Co-authored-by: Mateusz Kuprowski <[email protected]> Co-authored-by: Roman Isecke <[email protected]>

* update changelog * tidy

Update local fork

rbiseck3 · 2024-11-14T14:46:21Z

Would this be a replacement of the existing vector generated by the embedder step or does pinecone take in two different vectors so support hybrid search? I noticed in the PR, that the embedding is now down inline with the upload which we want to avoid. If needed, this might require a new embedder to be added and use that as part of the pipeline.

jaishirb and others added 30 commits October 16, 2024 23:38

Implementation of pinecone hybrid search and extra metadata allowed i…

6e8bd3b

…n stager

Merge pull request #1 from shadai-group/feature/support-pinecone-hybr…

9fc84ec

…id-search-and-extra-metadata Implementation of pinecone hybrid search and extra metadata allowed i…

reading of orig elements

a838a40

Merge pull request #2 from shadai-group/feature/read-orig-elements-fr…

6717a8c

…om-metadata reading of orig elements

version updated

5b780d2

Merge pull request #3 from shadai-group/feature/read-orig-elements-fr…

b6e4da2

…om-metadata version updated

local function to decode implemented

5799fb7

Merge pull request #4 from shadai-group/feature/read-orig-elements-fr…

ac91df6

…om-metadata local function to decode implemented

self added

c4f6e02

Merge pull request #5 from shadai-group/feature/read-orig-elements-fr…

eee1683

…om-metadata self added

add json load to reading orig elements

1fb3ac5

Merge pull request #6 from shadai-group/feature/read-orig-elements-fr…

9c02c32

…om-metadata add json load to reading orig elements

metadata fields added and bedrock model name included in call

0d203aa

Merge pull request #7 from shadai-group/feature/read-orig-elements-fr…

c666783

…om-metadata metadata fields added and bedrock model name included in call

default model usage to cohere and region to virginia

da41062

Merge pull request #8 from shadai-group/feature/read-orig-elements-fr…

f2a4eef

…om-metadata default model usage to cohere and region to virginia

bugfix/support optional access configs on connection configs (Unstruc…

cc91aa4

…tured-IO#183) * fix how models are mapped to flat data to support optional access configs * make sure connection config access config check passes if it's optional

feat/update kdbai to latest version (Unstructured-IO#187)

0fd5f97

* update kdbai to latest version * sync changelog

feat: add sampling functionality to fsspec indexers (Unstructured-IO#189

e98ff1f

) * feat: add sampling functionality to fsspec indexers * add e2e test * tidy * release

Feat: V2 Slack Source Connector (Unstructured-IO#180)

f04506c

Migrated Slack Source Connector to V2 --------- Co-authored-by: Filip Knefel <[email protected]>

bugfix/support async indexing (Unstructured-IO#192)

fa1108b

* Support async indexing in pipeling * bump changelog * fix asyncgenerator typing

fix/Databricks example. (Unstructured-IO#193)

f57d8c1

set changelog to new minor version (Unstructured-IO#194)

ce3f3a6

databricks volumes add .json (Unstructured-IO#198)

49cd82e

rbiseck3 and others added 13 commits November 10, 2024 13:55

Remove overwrite settings for fsspec and databricks connectors (Uns…

c34d8b8

…tructured-IO#213) * default overwrite to true for connectors * remove field * fix * fix * .

fix SQL Precheck bug (Unstructured-IO#205)

823b863

feat/release 0.2.2 (Unstructured-IO#221)

b1b1e5d

* update changelog * tidy

Merge pull request #9 from Unstructured-IO/main

ae7689e

Update local fork

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of hybrid search chunking strategy for pinecone, extra-metadata fields for chunks #224

Implementation of hybrid search chunking strategy for pinecone, extra-metadata fields for chunks #224

jaisir-shadai commented Nov 10, 2024 •

edited

Loading

rbiseck3 commented Nov 14, 2024

Implementation of hybrid search chunking strategy for pinecone, extra-metadata fields for chunks #224

Are you sure you want to change the base?

Implementation of hybrid search chunking strategy for pinecone, extra-metadata fields for chunks #224

Conversation

jaisir-shadai commented Nov 10, 2024 • edited Loading

rbiseck3 commented Nov 14, 2024

jaisir-shadai commented Nov 10, 2024 •

edited

Loading