Skip to content

kslohith/gtsearch

Repository files navigation

gtsearch

GTSearch is a search engine tailored for domain-specific inquiries related to Georgia Tech. It utilizes data obtained through a domain-specific web crawler, implemented with Scrapy as the crawling framework. Additionally, it employs a relevance engine powered by vector similarity search. We utilize Pinecone as the vector database to retrieve the top k similar documents, which are then passed as context to the OpenAI API to obtain the desired answer.

system design

Screenshot 2024-04-21 at 6 28 21 PM

System design for the crawling module

Screenshot 2024-04-21 at 6 28 37 PM

System design for the RAG module

Running Instructions

Install scrapy using pip

pip install scrapy

To run a crawl and insert relevant documents into pinecone

scrapy crawl tsearch -o search.json

File organisation

Spiders The 'spiders' folder contains the primary GTSearch spider along with other middleware required to operate the web crawler. We've integrated custom logic for comparing the crawled text with the base text using a vector similarity search, powered by Fast-embed. After obtaining the relevant documents, they are pushed into Pinecone, which serves as a vector database.

Server The 'server' folder houses the Flask web server responsible for hosting our search engine on the web. To run the server

python app.py

The endpoint '/tsearch/search' is a POST endpoint which takes a user query and gets the top k documents relevant to the user query from pinecone and we pass this as context to open-ai api to get the relevant answers.

About

Georgia Tech Specific Web crawler and RAG

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages