RAG PDF and Document Retrieval Framework

This repository provides a comprehensive framework for extracting and retrieving information from PDFs and text documents. It combines OCR capabilities with state-of-the-art language models to process text or image-based PDFs, create semantic embeddings, and retrieve information efficiently using a variety of advanced techniques.

Features

1. PDF Text and Image Extraction

Utilizes MiniCPM-Llama3-V-2_5, an advanced model from Hugging Face, for OCR-based text extraction from image-based PDFs.
Provides simple text extraction for text-based PDFs using the PyPDF2 library.

Model Reference: MiniCPM-Llama3-V-2_5 on Hugging Face

2. RAG (Retrieve, Augment, Generate) Workflow

Built on LangChain, a robust framework for building information retrieval systems.
Employs Ollama as a chat and embedding model for advanced natural language interaction and retrieval.

LangChain Reference: LangChain Documentation
Ollama installation guide: Ollama Installation and Usage Guide

Ollama available models: Ollama Installation and Usage Guide

3. Multi-Vector and Semantic Retrieval

Supports semantic chunking of text and multi-vector retrieval for enhanced accuracy.
Leverages Chroma and FAISS for vector storage and similarity search.

4. Query Answering

main.py Allows users to pass a question (query) along with the pdf document path to retrieve answers based on context.

Installation

Clone the repository:

git clone https://github.com/balbakri1/SimpleRAG.git
cd rag-retrieval

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__pycache__		__pycache__
documents		documents
README.MD		README.MD
Ragger.py		Ragger.py
main.py		main.py
miniCPM.py		miniCPM.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG PDF and Document Retrieval Framework

Features

1. PDF Text and Image Extraction

2. RAG (Retrieve, Augment, Generate) Workflow

3. Multi-Vector and Semantic Retrieval

4. Query Answering

Installation

About

Releases

Packages

Languages

BAMeScience/SimpleRAG

Folders and files

Latest commit

History

Repository files navigation

RAG PDF and Document Retrieval Framework

Features

1. PDF Text and Image Extraction

2. RAG (Retrieve, Augment, Generate) Workflow

3. Multi-Vector and Semantic Retrieval

4. Query Answering

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages