Skip to content

Latest commit

 

History

History
79 lines (59 loc) · 2.88 KB

README.md

File metadata and controls

79 lines (59 loc) · 2.88 KB

Document AMA

Building upon kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference with a Streamlit front-end. It allows to load any .txt file or .pdf document with text and asking any questions about it using LLMs on CPU.

Setting up

Docker

Build the docker image:

docker build -t document-ama .

Run it:

docker run -d -p 8501:8501 --name document-ama document-ama:latest

Open the Streamlit app in your browser:

http://localhost:8501/

Local env

Init a VENV, if desired:

python3 -m venv venv
source venv/bin/activate

Install requirements:

pip install -r requirements.txt

Run streamlit:

streamlit run app.py

A new browser window will pop up with the streamlit app.

Architecture and app.py code flow

(Also see the medium article for the original code architecture)

It is currently set up to run with the 8-bit quantized version of Llama2 that runs on GGML as the Q&A model and sentence-transformers/all-MiniLM-L6-v2 as the embeddings model. If those models are not present (e.g. in the first run), it will first download them.

Then, it will ask for a file to be uploaded. It needs to be either a .txt file or a .pdf file with selectable text in them. It will not try to OCR the document.

Once the document is uploaded, Langchain is being used to:

  1. Load those documents into it;
  2. Extract the text with PyPDF;
  3. Split it into 500-character chunks (with 50-character overlap);
  4. Calculate the embeddings vector of each chunk;
  5. and finally load those embeddings with their respective chunks into a FAISS database.

The FAISS files are being saved on disk under a folder with the checksum of each file, so if the exact same file is uploaded again it will just reuse the previously created database instead of re-doing them.

After that is done, it asks for the question. It will then load the LLM model into memory using CTransformers and keep using Langchain to:

  1. Load the FAISS db to memory;
  2. Build a prompt template with the question and the hardcoded prompt template string;
  3. Build a RetrievalQA database with the LLM to load the context relative to the question into the prompt template;
  4. Ask the RetrievalQA database the question

After the RetrievalQA returns with an answer, it will display the answer and the relevant passages of the text to that answer.