Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(app): AGE-1228 changelog 0.27.0 #2220

Merged
merged 4 commits into from
Nov 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 62 additions & 1 deletion docs/blog/main.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,62 @@ import Image from "@theme/IdealImage";

<section class="changelog">

### Observability and Prompt Management

_6 November 2024_

**v0.27.0**

<Image
style={{
display: "block",
margin: "5px auto",
width: "80%",
textAlign: "center",
}}
img={require("/images/observability/observability.png")}
alt="Observability view showing an open trace for an OpenAI application"
loading="lazy"
/>
This release is one of our biggest yet—one changelog hardly does it justice.

**First up: Observability**

We’ve had observability in beta for a while, but now it’s been completely rewritten,
with a brand-new UI and fully **open-source code**.

The new [Observability SDK](/observability/overview) is compatible with [OpenTelemetry (Otel)](https://opentelemetry.io/) and [gen-ai semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/). This means you get a lot of integrations right out of the box, like [LangChain](/observability/integrations/langchain), [OpenAI](/observability/integrations/openai), and more.

We’ll publish a full blog post soon, but here’s a quick look at what the new observability offers:

- A redesigned UI that lets you visualize nested traces, making it easier to understand what’s happening behind the scenes.

- The web UI lets you filter traces by name, cost, and other attributes—you can even search through them easily.

- The SDK is Otel-compatible, and we’ve already tested integrations for [OpenAI](/observability/integrations/openai), [LangChain](/observability/integrations/langchain), [LiteLLM](/observability/integrations/litellm), and [Instructor](/observability/integrations/instructor), with guides available for each. In most cases, adding a few lines of code will have you seeing traces directly in Agenta.

**Next: Prompt Management**

We’ve completely rewritten the [prompt management SDK](/prompt-management/overview), giving you full CRUD capabilities for prompts and configurations. This includes creating, updating, reading history, deploying new versions, and deleting old ones. You can find a first tutorial for this [here](/tutorials/sdk/manage-prompts-with-SDK).

**And finally: LLM-as-a-Judge Overhaul**

We’ve made significant upgrades to the [LLM-as-a-Judge evaluator](/evaluation/evaluators/llm-as-a-judge). It now supports prompts with multiple messages and has access to all variables in a test case. You can also switch models (currently supporting OpenAI and Anthropic). These changes make the evaluator much more flexible, and we’re seeing better results with it.

<Image
style={{
display: "block",
margin: "5px auto",
width: "50%",
textAlign: "center",
}}
img={require("/images/evaluation/llm-as-a-judge.gif")}
alt="Configuring the LLM-as-a-Judge evaluator"
loading="lazy"
/>

---

### New Application Management View and Various Improvements

_22 October 2024_
Expand Down Expand Up @@ -56,7 +112,12 @@ _22 August 2024_
<Image
img={require("/images/changelog/new_ui.png")}
alt="Button for exporting evaluation results"
style={{ width: "80%", maxWidth: "800px" }}
style={{
display: "block",
margin: "5px auto",
width: "80%",
textAlign: "center",
}}
/>
</div>

Expand Down
47 changes: 33 additions & 14 deletions docs/docs/evaluation/evaluators/04-llm-as-a-judge.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,26 +6,45 @@ LLM-as-a-Judge is an evaluator that uses an LLM to assess LLM outputs. It's part

![Configuration of LLM-as-a-judge](/images/evaluation/llm-as-a-judge.png)

The evaluator uses the `gpt-3.5` model. To use LLM-as-a-Judge, you'll need to set your OpenAI API key in the settings. The key is saved locally and only sent to our servers for evaluation—it's not stored there.
The evaluator has the following parameters:

You can configure the prompt used for evaluation. This evaluator has access to the inputs, outputs, and reference answers.
#### The Prompt

Here is an example prompt:
You can configure the prompt used for evaluation. The prompt can contain multiple messages in OpenAI format (role/content). All messages in the prompt have access to the inputs, outputs, and reference answers (any columns in the test set). To reference these in your prompts, use the following variables:

- `correct_answer`: the column with the reference answer in the test set. You can configure the name of this column under `Advanced Setting` in the configuration modal.
- `prediction`: the output of the llm application
- `$input_column_name`: the value of any input column for the given row of your test set

Here's the default prompt used for the country expert demo application (note that it uses the `country` input column from our test set):

**System prompt:**

```
We have an LLM App that we want to evaluate its outputs.
Based on the prompt and the parameters provided below evaluate the output based on the evaluation strategy below:
Evaluation strategy: 0 to 10 0 is very bad and 10 is very good.
You are an evaluator grading an LLM App.
You will be given INPUTS, the LLM APP OUTPUT, the CORRECT ANSWER, the PROMPT used in the LLM APP.
Here is the grade criteria to follow:
:- Ensure that the LLM APP OUTPUT has the same meaning as the CORRECT ANSWER

Inputs: country: {country}
Expected Answer Column: {correct_answer}
Evaluate this: {variant_output}
SCORE:
-The score should be between 0 and 10
-A score of 10 means that the answer is perfect. This is the highest (best) score.
A score of 0 means that the answer does not any of of the criteria. This is the lowest possible score you can give.

Answer ONLY with one of the given grading or evaluation options.
ANSWER ONLY THE SCORE. DO NOT USE MARKDOWN. DO NOT PROVIDE ANYTHING OTHER THAN THE NUMBER
```

The prompt has access to these variables:
**User prompt:**

- `correct_answer`: the column with the reference answer in the test set. You can configure the name of this column under `Advanced Setting` in the configuration modal.
- `variant_output`: the output of the llm application
- The inputs to the LLM application, named as the variable name. For the default prompt (Capital expert), the input is `country`
```

INPUTS:
country: {country}
CORRECT ANSWER:{correct_answer}
LLM APP OUTPUT: {prediction}.

```

### The Model

The model can be configured to select one of the supported options (`gpt-3.5-turbo`, `gpt-4`, `claude-3-5-sonnet`, `claude-3-5-haiku`, `claude-3-5-opus`). To use LLM-as-a-Judge, you'll need to set your OpenAI or Anthropic API key in the settings. The key is saved locally and only sent to our servers for evaluation—it's not stored there.
Binary file added docs/static/images/evaluation/llm-as-a-judge.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/static/images/evaluation/llm-as-a-judge.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.