Skip to content

Latest commit

 

History

History
41 lines (29 loc) · 2.36 KB

README.md

File metadata and controls

41 lines (29 loc) · 2.36 KB

LearningLion-WOO

The project is a study on the use of generative AI to improve the services of SSC-ICT by supporting employees and optimizing internal processes. Originally, the focus is on generative large language models (LLM), in the form of Retrieval Augmented Generation (RAG), because they can have the most significant impact on the daily work of SSC-ICT employees. This version dipes deeper into the Retrieval part in RAG. The original version can be found here.

This version serves as part of the Master Thesis of Nicky Ju.

The paper corresponding to this repository can be found in the TU Delft Repository.

Flow Chart

Flow Chart

Files

Filenames starting with

  • create --> create evaluation files with specific preprocessing
  • evaluate --> running queries on vector database/corpus
  • ingest --> creating vector database/corpus
  • preprocess --> preprocess the data in different ways before creating the database
  • relevance --> (re-)evaluating the results

Complete Example Pipeline

This guide assumes that you are familiar with the basics of Python (such as setting up environment, and installing packages).

  1. First steps
  2. Preprocess Data
    • Run preprocess preprocess_real_words.py or preprocess_stem_stopwords.py to preprocess the data in different ways.
  3. Database creation
    • Create Vector Store with ingest_embeddings.py.
    • Create BM25 Corpus with ingest_bm25.py.
  4. Evaluation
    • Run the evaluation files with the vector store/bm25 corpus evaluate_bm25.py or evaluate_embeddings.py.
  5. Evaluation metrics
    • relevance_evaluation.ipynb to calculate basic metrics like precision and recall.
    • relevance_dossier_average.ipynb for frequency based, relevance_dossier_MAP.ipynb for weighted frequency based.