Skip to content

NLP Transformer pipeline + LLM QA with RAG - Homeworks for BigData Team NLP course on LLMs

Notifications You must be signed in to change notification settings

exsandebest/bdt-nlp-course

Repository files navigation

NLP with LLMs: Text Classification using Transformer and Retrieval-Augmented Question Answering

This repository contains two homework notebooks from the NLP Practice course on LLMs course by BigData Team. Each notebook demonstrates different applications of Large Language Models (LLMs) for various NLP tasks, including text classification and question answering using retrieval-augmented generation (RAG).

Notebooks Overview

Task: Text Classification using Transformers

  • Implements an end-to-end NLP workflow using distilbert-base-uncased by default for text classification tasks
  • Features a custom dataset handling with tokenization and batching using PyTorch
  • Includes a comprehensive ModelTrainer class for loading datasets, training, validation, metrics calculation (f1_score, precision, recall), and model saving
  • Uses wandb for logging and experiment tracking
  • Offers multi-GPU support with data parallelism

This notebook serves as a strong baseline for fine-tuning transformer models for classification tasks and can be easily adapted for other datasets or models.

Task: Question Answering using Retrieval-Augmented Generation (RAG)

  • Demonstrates an experiment using a Large Language Model (LLM) for question answering, with and without retrieval-augmented generation (RAG)
  • Implements both plain LLM chain responses and RAG-based methods using the google/flan-t5-large model
  • Utilizes FAISS vector-based retrieval for document support with embeddings generated by sentence-transformers/all-MiniLM-L6-v2
  • Incorporates different configurations such as plain LLM responses, RAG with source tracking (RetrievalQA), and RAG with detailed source chains (RetrievalQAWithSourcesChain)
  • Uses external data (data/cats_content.txt) to support enhanced question-answering performance

This notebook provides a comprehensive exploration of how RAG can be used to improve the accuracy and reliability of LLM-based question answering.


The templates and resources were taken from the original course repository: big-data-team/nlp-course

Certificate of completion of the course with honors: Daniil Bogdanov