New Evaluation Tooling #677

drazvan · 2024-08-14T12:48:51Z

This PR adds new tooling for evaluating a guardrail configuration.
NOTE: documentation is minimal; still WIP.

Below is a quick overview for the nemoguardrails eval CLI.

Run Evaluations

To run a new evaluation with a guardrail configuration:

nemoguardrails eval run -g <GUARDRAIL_CONFIG_PATH> -o <OUTPUT_PATH>

Check Compliance

To check the compliance with the policies, you can use the LLM-as-a-judge method.

nemoguardrails eval check-compliance --llm-judge=<LLM_MODEL_NAME> -o <OUTPUT_PATH>

You can use any LLM supported by NeMo Guardrails.

models:
  - type: llm-judge
    engine: openai
    model: gpt-4

  - type: llm-judge
    engine: nvidia_ai_endpoints
    model: meta/llama3-70b-instruct

Review and Analyze

To review and analyze the results, launch the NeMo Guardrails Eval UI:

nemoguardrails eval ui

…rovements.

…rompt per policy.

…tion.

…_output_for_policy`.

drazvan added 29 commits July 29, 2024 11:00

Fix wrong import.

3a0813c

Refactor nemoguardrails.eval into nemoguardrails.evaluate.

4043caf

Fix a small bug related to a pydantic validator.

4972bf2

Fix typo in PyPI project description.

237f823

Refactor LLM initialization code.

bf0f333

Fix issue with the TOKENIZERS_PARALLELISM warning.

d7545c5

Fix small bug in the config validation logic.

f9a3857

Add the first version of nemoguardrails eval run/review/summary.

2e36d41

Add test for loading an eval config.

0bc6d3d

Start docs for nemoguardrails eval.

5e6b637

Change default output format to JSON and optimize YAML parsing.

9fd6227

Merged review and summary into a single ui command. README. Other imp…

c7cbee8

…rovements.

Add Config page and a few other improvements.

36015cc

Fix bug with running compliance checks.

11b673f

Fix small issue with detecting output paths.

9d29a44

Multiple fixes, support for parallelism, evaluation context, custom p…

fd8ea0f

…rompt per policy.

Add support for parallelism when running the interactions.

beadd8d

Fix issues with reviewing the interactions. Allow reloading from disk.

13f3861

A couple of small fixes for the Review interface.

18db046

Small tweak to the latencies table in summary.

11009b0

Add compliance inconsistencies filter.

19b2139

Eval docs refactoring.

701483e

Fix how the LLM model name is detected and recorded during the evalua…

09b70a0

…tion.

Update the charts for the Resource Usage part.

e4f5bed

Update to the latency reports. Short and detailed summary.

63a8ba9

Support for expected latencies.

50d5941

Include all expected outputs in the prompts and support for `expected…

0f6e488

…_output_for_policy`.

A few final tweaks.

d4b006a

Documentation tweaks and sample_abc config.

7d2648e

drazvan marked this pull request as draft August 14, 2024 12:49

drazvan and others added 4 commits August 14, 2024 16:01

Add the missing streamlit dependency to pyproject.toml.

18fa333

Updated the evaluation methodology documentation

e1a9619

Tweaks to the new evaluation docs.

6317826

Merge branch 'develop' into feature/eval-tooling

24e2a0d

drazvan marked this pull request as ready for review August 20, 2024 12:57

drazvan merged commit d25e379 into develop Aug 20, 2024
4 checks passed

emmanuel-ferdman mentioned this pull request Sep 15, 2024

Update evaluate directory reference #751

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Evaluation Tooling #677

New Evaluation Tooling #677

drazvan commented Aug 14, 2024

New Evaluation Tooling #677

New Evaluation Tooling #677

Conversation

drazvan commented Aug 14, 2024

Run Evaluations

Check Compliance

Review and Analyze