Name		Name	Last commit message	Last commit date
parent directory ..
app		app
data		data
README.md		README.md
code.md		code.md
dashboard.json		dashboard.json
grafana.md		grafana.md
offline-rag-evaluation.ipynb		offline-rag-evaluation.ipynb

README.md

Module 4: Evaluation and Monitoring

In this module, we will learn how to evaluate and monitor our LLM and RAG system.

In the evaluation part, we assess the quality of our entire RAG system before it goes live.

In the monitoring part, we collect, store and visualize metrics to assess the answer quality of a deployed LLM. We also collect chat history and user feedback.

4.1 Introduction to monitoring answer quality

Why monitoring LLM systems?
Monitoring answer quality of LLMs
Monitoring answer quality with user feedback
What else to monitor, that is not covered by this module?

4.2 Offline vs Online (RAG) evaluation

Modules recap
Online vs offline evaluation
Offline evaluation metrics

4.3 Generating data for offline RAG evaluation

Links:

notebook
results-gpt4o.csv (answers from GPT-4o)
results-gpt35.csv (answers from GPT-3.5-Turbo)

4.4 Offline RAG evaluation: cosine similarity

Content

A->Q->A' cosine similarity
Evaluating gpt-4o
Evaluating gpt-3.5-turbo
Evaluating gpt-4o-mini

Links:

notebook
results-gpt4o-cosine.csv (answers with cosine calculated from GPT-4o)
results-gpt35-cosine.csv (answers with cosine calculated from GPT-3.5-Turbo)
results-gpt4o-mini.csv (answers from GPT-4o-mini)
results-gpt4o-mini-cosine.csv (answers with cosine calculated from GPT-4o-mini)

4.5 Offline RAG evaluation: LLM as a judge

LLM as a judge
A->Q->A' evaluation
Q->A evaluation

Links:

notebook
evaluations-aqa.csv (A->Q->A evaluation results)
evaluations-qa.csv (Q->A evaluation results) https://youtu.be/

4.6 Capturing user feedback

You can see the prompts and the output from claude here

Content

Adding +1 and -1 buttons
Setting up a postgres database
Putting everything in docker compose

pip install pgcli
pgcli -h localhost -U your_username -d course_assistant -W

Links:

4.6.2 Capturing user feedback: part 2

adding vector search
adding OpenAI

Links:

4.7 Monitoring the system

Setting up Grafana
Tokens and costs
QA relevance
User feedback
Other metrics

Links:

4.7.2 Extra Grafana video

Grafana variables
Exporting and importing dashboards

Links:

Homework

See here

Extra resources

Overview of the module

https://www.loom.com/share/1dd375ec4b0d458fabdfc2b841089031

Notes

Notes by slavaheroes
Did you take notes? Add them above this line (Send a PR with links to your notes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

04-monitoring

04-monitoring

README.md

Module 4: Evaluation and Monitoring

4.1 Introduction to monitoring answer quality

4.2 Offline vs Online (RAG) evaluation

4.3 Generating data for offline RAG evaluation

4.4 Offline RAG evaluation: cosine similarity

4.5 Offline RAG evaluation: LLM as a judge