Document Database and Query System

This repository contains a system for building and querying a database of document embeddings, designed to handle document storage, retrieval, and question-answering tasks using the Chroma database. The system supports dynamic updates by calculating hashes for document content and provides efficient retrieval through similarity searches using embeddings generated by various models (currently set to use BAAI/bge-base-en-v1.5 from HuggingFace).

Features

Document Ingestion: Load PDF documents, split them into manageable chunks, and store them in a Chroma database with corresponding embeddings and metadata.
Hash-based Updates: Efficiently track changes in document content via content hashing and only update changed chunks.
Embeddings with HuggingFace: Leverages the HuggingFaceBgeEmbeddings for generating document and query embeddings.
Query Interface: A command-line tool to query the database and retrieve relevant document chunks using a similarity search and an LLM-powered response system.

Installation

Clone the repository:

git clone https://github.com/DrR0bot/DDQS_python.git
cd DDQS_python

Clone the repository: Install required dependencies: This project uses Python, so ensure you have it installed. Then, install the dependencies using pip:
```
pip install -r requirements.txt
```
Download other requirements: Dowload the required Spacy package within the venv
```
python -m spacy download en_core_web_sm
```
Install PyTorch with CUDA (Optional): If you want to leverage GPU (CUDA) for faster embedding generation, install PyTorch with CUDA support:

Using pip:
```
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
```
Using conda: If you are using Conda, install PyTorch and related libraries with CUDA support:
```
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
```

Usage

Populate the Database The populate_database.py script loads documents from a directory, splits them into chunks, generates embeddings, and stores them in the Chroma database. If a document is modified, only the updated chunks will be added or replaced.
```
python populate_database.py [--reset]
```
--reset: Optional flag to clear the existing database before populating it with new data.
Query the Database Once the documents are ingested into the database, you can run queries to retrieve relevant chunks based on their embeddings and generate a response using an LLM model.
```
python query_data.py "Your query text here"
```

The system will retrieve the most similar document chunks based on the query text and generate a response using the context.

File Structure

populate_database.py: Handles loading PDF documents, splitting them into chunks, generating embeddings, and storing/updating them in the Chroma database.
get_embedding_function.py: Contains the logic to select and configure the embedding model (currently using HuggingFaceBgeEmbeddings).
query_data.py: Handles querying the database using a similarity search and generating responses with an LLM model.

Configuration

Changing the Embedding Model By default, the system uses the HuggingFaceBgeEmbeddings model (BGE Base) from HuggingFace. You can change the model or switch between using CPU and GPU in get_embedding_function.py.

To change the device to GPU, modify this line in get_embedding_function.py:

    model_kwargs = {"device": "cuda"}  # Use CUDA for GPU

To use another embedding model, change the model_name parameter:

    model_name = "BAAI/bge-base-en-v1.5"  # Change to your preferred model

Modifying Document Paths

In populate_database.py, update the DATA_PATH variable to point to the directory where your PDF files are stored:

    DATA_PATH = "data"  # Change this to the path of your documents

Requirements

Python 3.8+
HuggingFace Transformers
PyTorch (with or without CUDA)
LangChain Community Packages

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
__pycache__		__pycache__
data		data
README.md		README.md
get_embedding_function.py		get_embedding_function.py
keywords_extraction.py		keywords_extraction.py
populate_database.py		populate_database.py
query_data.py		query_data.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Database and Query System

Features

Installation

Usage

File Structure

Configuration

Modifying Document Paths

Requirements

License

About

Releases

Packages

Languages

DrR0bot/DDQS_python

Folders and files

Latest commit

History

Repository files navigation

Document Database and Query System

Features

Installation

Usage

File Structure

Configuration

Modifying Document Paths

Requirements

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages