In this module, we will learn how to evaluate and monitor our LLM and RAG system.
In the evaluation part, we assess the quality of our entire RAG system before it goes live.
In the monitoring part, we collect, store and visualize metrics to assess the answer quality of a deployed LLM. We also collect chat history and user feedback.
- Why monitoring LLM systems?
- Monitoring answer quality of LLMs
- Monitoring answer quality with user feedback
- What else to monitor, that is not covered by this module?
- Modules recap
- Online vs offline evaluation
- Offline evaluation metrics
Links:
- notebook
- results-gpt4o.csv (answers from GPT-4o)
- results-gpt35.csv (answers from GPT-3.5-Turbo)
Content
- A->Q->A' cosine similarity
- Evaluating gpt-4o
- Evaluating gpt-3.5-turbo
- Evaluating gpt-4o-mini
Links:
- notebook
- results-gpt4o-cosine.csv (answers with cosine calculated from GPT-4o)
- results-gpt35-cosine.csv (answers with cosine calculated from GPT-3.5-Turbo)
- results-gpt4o-mini.csv (answers from GPT-4o-mini)
- results-gpt4o-mini-cosine.csv (answers with cosine calculated from GPT-4o-mini)
- LLM as a judge
- A->Q->A' evaluation
- Q->A evaluation
Links:
- notebook
- evaluations-aqa.csv (A->Q->A evaluation results)
- evaluations-qa.csv (Q->A evaluation results) https://youtu.be/
You can see the prompts and the output from claude here
Content
- Adding +1 and -1 buttons
- Setting up a postgres database
- Putting everything in docker compose
pip install pgcli
pgcli -h localhost -U your_username -d course_assistant -W
Links:
- adding vector search
- adding OpenAI
Links:
- Setting up Grafana
- Tokens and costs
- QA relevance
- User feedback
- Other metrics
Links:
- Grafana variables
- Exporting and importing dashboards
Links:
See here
https://www.loom.com/share/1dd375ec4b0d458fabdfc2b841089031
- Notes by slavaheroes
- Did you take notes? Add them above this line (Send a PR with links to your notes)