Multiple Vector databases

I believe that most organizations will have multiple vector databases with their embeddings. When you want an answer, you may need to query them all. That can be expensive and time consuming. A reference corpus, which is another vector database with embeddings, can help.

The vector databases

For this example we are using indexes in Azure AI Search to simulate vector databases. In the real world there would be many different technologies for the vector databases. In the folter sample_docs there are two PDF documents: Mac.pdf contains the wiki of the History of Apple Mac computers. pdf-sample.pdf contains a short definition of what PDF files are. In the code we create two different indexes for each file:

# Let's create two indexes
process_file("../sample_docs/pdf-sample.pdf", "pdf-sample")
process_file("../sample_docs/Mac.pdf", "mac")

The name of the indexes are: pdf-sample and mac.

How does it work

The first time you ask a question, you have to query all vector databases looking for an answer After running that query, your reference corpus will be updated: The answer for question X, is in Vector store Y. This is will be also stored as an embedding. This can be seen here from the logs:

  -- Asking the reference corpus first -- 
    -- The corpus does not know which vector store has information about:  Describe characteristics of an iMac G3? -- 
    -- Querying all vector stores -- 
  -- All vector stores will be queried, now Querying vector store: pdf-sample
    --  Vector store pdf-sample does not have the answer for the question: Describe characteristics of an iMac G3?
All vector stores will be queried, now Querying vector store: mac
    -- The corpus has been updated: mac has knowledge about the answer for the question: Describe characteristics of an iMac G3? --

You can see that now the corpus "knows" that the answer for 'Describe characteristics of an iMac G3' is located in the Vector Database: mac

The next time you ask the same question, or something close to it: The corpus will know where to find the answer and will only query that vector database that can be seen from the logs:

-- Second iteration -- 
  -- Asking the reference corpus first -- 
    -- The corpus knows which vector store has information about:  Describe characteristics of an iMac G3? -- 
    -- Querying specific vector store: mac
The answer is: The iMac G3, introduced by Apple in 1998, was a significant product that helped .........

What you need to run

You need the following Azure Services deployed, and their keys:

AI Search
Document Search
OpenAI
- Model text-embedding-ada-002 deployed
- Model gpt-4-32k deployed (older may work not test)

How to run

Start github codespaces Open in Codespaces
Rename file sampleenv.txt to .env
- Update the URLs and keys of your azure services
Open the folder Notebooks, and there open Notebook corpus.ipynb
On the upper right press "Select Kernel"
- Choose Python enviroments
  - Choose Python 3.1xx.x
Press the button "Run All"

Experimenting

After the first execution you are free to run the last cell of the Notebook that starts with:

# Demo the a query with the corpus and without the corpus

You can change the question on that cell

query_with_corpus("Describe characteristics of an iMac G3?")

If you run the whole notebook multiple times

There is not logic yet to drop or update the indexes, go to your Azure AI search and drop the indexes manually if you see erratic behaviour.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.devcontainer		.devcontainer
Notebooks		Notebooks
media		media
sample_docs		sample_docs
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
sampleenv.txt		sampleenv.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multiple Vector databases

The vector databases

How does it work

What you need to run

How to run

Experimenting

If you run the whole notebook multiple times

About

Releases

Packages

Languages

License

MiguelElGallo/ragvectordb

Folders and files

Latest commit

History

Repository files navigation

Multiple Vector databases

The vector databases

How does it work

What you need to run

How to run

Experimenting

If you run the whole notebook multiple times

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages