Agenta-AI · mmabrouk · Nov 12, 2024 · Nov 6, 2024 · Nov 12, 2024 · Nov 12, 2024
diff --git a/docs/blog/main.mdx b/docs/blog/main.mdx
@@ -9,6 +9,62 @@ import Image from "@theme/IdealImage";
 
 <section class="changelog">
 
+### Observability and Prompt Management
+
+_6 November 2024_
+
+**v0.27.0**
+
+<Image
+  style={{
+    display: "block",
+    margin: "5px auto",
+    width: "80%",
+    textAlign: "center",
+  }}
+  img={require("/images/observability/observability.png")}
+  alt="Observability view showing an open trace for an OpenAI application"
+  loading="lazy"
+/>
+This release is one of our biggest yet—one changelog hardly does it justice.
+
+**First up: Observability**
+
+We’ve had observability in beta for a while, but now it’s been completely rewritten,
+with a brand-new UI and fully **open-source code**.
+
+The new [Observability SDK](/observability/overview) is compatible with [OpenTelemetry (Otel)](https://opentelemetry.io/) and [gen-ai semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/). This means you get a lot of integrations right out of the box, like [LangChain](/observability/integrations/langchain), [OpenAI](/observability/integrations/openai), and more.
+
+We’ll publish a full blog post soon, but here’s a quick look at what the new observability offers:
+
+- A redesigned UI that lets you visualize nested traces, making it easier to understand what’s happening behind the scenes.
+
+- The web UI lets you filter traces by name, cost, and other attributes—you can even search through them easily.
+
+- The SDK is Otel-compatible, and we’ve already tested integrations for [OpenAI](/observability/integrations/openai), [LangChain](/observability/integrations/langchain), [LiteLLM](/observability/integrations/litellm), and [Instructor](/observability/integrations/instructor), with guides available for each. In most cases, adding a few lines of code will have you seeing traces directly in Agenta.
+
+**Next: Prompt Management**
+
+We’ve completely rewritten the [prompt management SDK](/prompt-management/overview), giving you full CRUD capabilities for prompts and configurations. This includes creating, updating, reading history, deploying new versions, and deleting old ones. You can find a first tutorial for this [here](/tutorials/sdk/manage-prompts-with-SDK).
+
+**And finally: LLM-as-a-Judge Overhaul**
+
+We’ve made significant upgrades to the [LLM-as-a-Judge evaluator](/evaluation/evaluators/llm-as-a-judge). It now supports prompts with multiple messages and has access to all variables in a test case. You can also switch models (currently supporting OpenAI and Anthropic). These changes make the evaluator much more flexible, and we’re seeing better results with it.
+
+<Image
+  style={{
+    display: "block",
+    margin: "5px auto",
+    width: "50%",
+    textAlign: "center",
+  }}
+  img={require("/images/evaluation/llm-as-a-judge.gif")}
+  alt="Configuring the LLM-as-a-Judge evaluator"
+  loading="lazy"
+/>
+
+---
+
 ### New Application Management View and Various Improvements
 
 _22 October 2024_
@@ -56,7 +112,12 @@ _22 August 2024_
   <Image
     img={require("/images/changelog/new_ui.png")}
     alt="Button for exporting evaluation results"
-    style={{ width: "80%", maxWidth: "800px" }}
+    style={{
+      display: "block",
+      margin: "5px auto",
+      width: "80%",
+      textAlign: "center",
+    }}
   />
 </div>
 

diff --git a/docs/docs/evaluation/evaluators/04-llm-as-a-judge.mdx b/docs/docs/evaluation/evaluators/04-llm-as-a-judge.mdx
@@ -6,26 +6,45 @@ LLM-as-a-Judge is an evaluator that uses an LLM to assess LLM outputs. It's part
 
 ![Configuration of LLM-as-a-judge](/images/evaluation/llm-as-a-judge.png)
 
-The evaluator uses the `gpt-3.5` model. To use LLM-as-a-Judge, you'll need to set your OpenAI API key in the settings. The key is saved locally and only sent to our servers for evaluation—it's not stored there.
+The evaluator has the following parameters:
 
-You can configure the prompt used for evaluation. This evaluator has access to the inputs, outputs, and reference answers.
+#### The Prompt
 
-Here is an example prompt:
+You can configure the prompt used for evaluation. The prompt can contain multiple messages in OpenAI format (role/content). All messages in the prompt have access to the inputs, outputs, and reference answers (any columns in the test set). To reference these in your prompts, use the following variables:
+
+- `correct_answer`: the column with the reference answer in the test set. You can configure the name of this column under `Advanced Setting` in the configuration modal.
+- `prediction`: the output of the llm application
+- `$input_column_name`: the value of any input column for the given row of your test set
+
+Here's the default prompt used for the country expert demo application (note that it uses the `country` input column from our test set):
+
+**System prompt:**
 
 ```
-We have an LLM App that we want to evaluate its outputs.
-Based on the prompt and the parameters provided below evaluate the output based on the evaluation strategy below:
-Evaluation strategy: 0 to 10 0 is very bad and 10 is very good.
+You are an evaluator grading an LLM App.
+ You will be given INPUTS, the LLM APP OUTPUT, the CORRECT ANSWER, the PROMPT used in the LLM APP.
+ Here is the grade criteria to follow:
+:- Ensure that the LLM APP OUTPUT has the same meaning as the CORRECT ANSWER
 
-Inputs: country: {country}
-Expected Answer Column: {correct_answer}
-Evaluate this: {variant_output}
+SCORE:
+-The score should be between 0 and 10
+-A score of 10 means that the answer is perfect. This is the highest (best) score.
+A score of 0 means that the answer does not any of of the criteria. This is the lowest possible score you can give.
 
-Answer ONLY with one of the given grading or evaluation options.
+ANSWER ONLY THE SCORE. DO NOT USE MARKDOWN. DO NOT PROVIDE ANYTHING OTHER THAN THE NUMBER
 ```
 
-The prompt has access to these variables:
+**User prompt:**
 
-- `correct_answer`: the column with the reference answer in the test set. You can configure the name of this column under `Advanced Setting` in the configuration modal.
-- `variant_output`: the output of the llm application
-- The inputs to the LLM application, named as the variable name. For the default prompt (Capital expert), the input is `country`
+```
+
+INPUTS:
+country: {country}
+CORRECT ANSWER:{correct_answer}
+LLM APP OUTPUT: {prediction}.
+
+```
+
+### The Model
+
+The model can be configured to select one of the supported options (`gpt-3.5-turbo`, `gpt-4`, `claude-3-5-sonnet`, `claude-3-5-haiku`, `claude-3-5-opus`). To use LLM-as-a-Judge, you'll need to set your OpenAI or Anthropic API key in the settings. The key is saved locally and only sent to our servers for evaluation—it's not stored there.
diff --git a/docs/static/images/evaluation/llm-as-a-judge.gif b/docs/static/images/evaluation/llm-as-a-judge.gif
diff --git a/docs/static/images/evaluation/llm-as-a-judge.png b/docs/static/images/evaluation/llm-as-a-judge.png