This repository contains two homework notebooks from the NLP Practice course on LLMs course by BigData Team. Each notebook demonstrates different applications of Large Language Models (LLMs) for various NLP tasks, including text classification and question answering using retrieval-augmented generation (RAG).
Task: Text Classification using Transformers
- Implements an end-to-end NLP workflow using
distilbert-base-uncased
by default for text classification tasks - Features a custom dataset handling with tokenization and batching using PyTorch
- Includes a comprehensive
ModelTrainer
class for loading datasets, training, validation, metrics calculation (f1_score
,precision
,recall
), and model saving - Uses
wandb
for logging and experiment tracking - Offers multi-GPU support with data parallelism
This notebook serves as a strong baseline for fine-tuning transformer models for classification tasks and can be easily adapted for other datasets or models.
Task: Question Answering using Retrieval-Augmented Generation (RAG)
- Demonstrates an experiment using a Large Language Model (LLM) for question answering, with and without retrieval-augmented generation (RAG)
- Implements both plain LLM chain responses and RAG-based methods using the
google/flan-t5-large
model - Utilizes FAISS vector-based retrieval for document support with embeddings generated by
sentence-transformers/all-MiniLM-L6-v2
- Incorporates different configurations such as plain LLM responses, RAG with source tracking (
RetrievalQA
), and RAG with detailed source chains (RetrievalQAWithSourcesChain
) - Uses external data (
data/cats_content.txt
) to support enhanced question-answering performance
This notebook provides a comprehensive exploration of how RAG can be used to improve the accuracy and reliability of LLM-based question answering.
The templates and resources were taken from the original course repository: big-data-team/nlp-course
Certificate of completion of the course with honors: Daniil Bogdanov