diff --git a/README.md b/README.md
index 8a77950df..9b0e81d6c 100644
--- a/README.md
+++ b/README.md
@@ -18,14 +18,14 @@ The key features of the Intelligence Layer are:
 
 Not sure where to start? Familiarize yourself with the Intelligence Layer using the below notebooks.
 
-| Order | Task                           | Description                             | Notebook 📓                                                                     |
-| ----- | ------------------------------ | --------------------------------------- | ------------------------------------------------------------------------------- |
-| 1     | Summarization                  | Summarize a document                    | [summarize.ipynb](./src/examples/summarize.ipynb)                               |
-| 2     | Question Answering             | Various approaches for QA               | [qa.ipynb](./src/examples/qa.ipynb)                                             |
-| 3     | Quickstart task                | Build a custom task for your use case   | [quickstart_task.ipynb](./src/examples/quickstart_task.ipynb)                   |
-| 4     | Single label Classification    | Conduct zero-shot text classification   | [single_label_classify.ipynb](./src/examples/single_label_classify.ipynb)       |
-| 5     | Embedding based Classification | Classify texts on the basis of examples | [embedding_based_classify.ipynb](./src/examples/embedding_based_classify.ipynb) |
-| 6     | Document Index                 | Connect your proprietary knowledge base | [document_index.ipynb](./src/examples/document_index.ipynb)                     |
+| Order | Task               | Description                               | Notebook 📓                                                   |
+| ----- | ------------------ | ----------------------------------------- | ------------------------------------------------------------- |
+| 1     | Summarization      | Summarize a document                      | [summarize.ipynb](./src/examples/summarize.ipynb)             |
+| 2     | Question Answering | Various approaches for QA                 | [qa.ipynb](./src/examples/qa.ipynb)                           |
+| 3     | Quickstart task    | Build a custom task for your use case     | [quickstart_task.ipynb](./src/examples/quickstart_task.ipynb) |
+| 4     | Classification     | Learn about two methods of classification | [classification.ipynb](./src/examples/classification.ipynb)   |
+| 5     | Evaluation         | Evaluate LLM-based methodologies          | [evaluation.ipynb](./src/examples/evaluation.ipynb)           |
+| 6     | Document Index     | Connect your proprietary knowledge base   | [document_index.ipynb](./src/examples/document_index.ipynb)   |
 
 ## Getting started with the Jupyter Notebooks
 
diff --git a/src/examples/classification.ipynb b/src/examples/classification.ipynb
new file mode 100644
index 000000000..91b016597
--- /dev/null
+++ b/src/examples/classification.ipynb
@@ -0,0 +1,454 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Classification\n",
+    "\n",
+    "Language models offer unprecedented capabilities in understanding and generating human-like text.\n",
+    "One of the pressing issues in their application is the classification of vast amounts of data.\n",
+    "Traditional methods often require manual labeling and can be time-consuming and prone to errors.\n",
+    "LLMs, on the other hand, can swiftly process and categorize enormous datasets with minimal human intervention.\n",
+    "By leveraging LLMs for classification tasks, organizations can unlock insights from their data more efficiently, streamline their workflows, and harness the full potential of their information assets.\n",
+    "\n",
+    "In this notebook, we present to alternative ways for classifying text using Aleph Alpha's Luminous models.\n",
+    "First, let's have a look at single-label classification using prompting.\n",
+    "\n",
+    "### Prompt-based single-label classification\n",
+    "\n",
+    "Single-label classification refers to the task of categorizing data points into one of n distinct categories or classes.\n",
+    "In this type of classification, each input is assigned to only one class, ensuring that no overlap exists between categories.\n",
+    "Common applications of single-label classification include email spam detection, where emails are classified as either \"spam\" or \"not spam\", or sentiment classification, where a text can be \"positive\", \"negative\" or \"neutral\".\n",
+    "When trying to solve this issue in a prompt-based manner, our primary goal is to construct a prompt that instructs the model to accurately predict the correct class for any given input.\n",
+    "\n",
+    "### When should you use prompt-based classification?\n",
+    "\n",
+    "We recommend using this type of classification when...\n",
+    "- ...the labels are easily understood (they don't require explanation or examples).\n",
+    "- ...the labels cannot be recognized purely by their semantic meaning.\n",
+    "- ...many examples for each label aren't readily available.\n",
+    "\n",
+    "### Example snippet\n",
+    "\n",
+    "Running the following code will instantiate a `SingleLabelClassify` that leverages a prompt for classification.\n",
+    "We can now enter any `ClassifyInput` so that the task returns each label along with its probability.\n",
+    "In addition, note the `debug_log`, which will give a comprehensive overview of the result.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from os import getenv\n",
+    "\n",
+    "from aleph_alpha_client import Client\n",
+    "\n",
+    "from intelligence_layer.use_cases.classify.single_label_classify import ClassifyInput, SingleLabelClassify\n",
+    "from intelligence_layer.core.task import Chunk\n",
+    "from intelligence_layer.core.logger import InMemoryDebugLogger\n",
+    "\n",
+    "text_to_classify = Chunk(\"In the distant future, a space exploration party embarked on a thrilling journey to the uncharted regions of the galaxy. \\n\\\n",
+    "With excitement in their hearts and the cosmos as their canvas, they ventured into the unknown, discovering breathtaking celestial wonders. \\n\\\n",
+    "As they gazed upon distant stars and nebulas, they forged unforgettable memories that would forever bind them as pioneers of the cosmos.\")\n",
+    "labels = [\"happy\", \"angry\", \"sad\"]\n",
+    "client = Client(getenv(\"AA_TOKEN\"))\n",
+    "task = SingleLabelClassify(client)\n",
+    "input = ClassifyInput(\n",
+    "    chunk=text_to_classify,\n",
+    "    labels=labels\n",
+    ")\n",
+    "\n",
+    "debug_log = InMemoryDebugLogger(name=\"classify\")\n",
+    "output = task.run(input, debug_log)\n",
+    "for label, score in output.scores.items():\n",
+    "    print(f\"{label}: {round(score, 4)}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### How does this implementation work?\n",
+    "\n",
+    "We prompt the model multiple times, each time supplying the text, or chunk, and one label at a time.\n",
+    "Note that we also supply each label, rather than letting the model generate it.\n",
+    "\n",
+    "To further explain this, let's start with a more familiar case.\n",
+    "Intuitively, one would probably prompt a model like so:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from aleph_alpha_client import PromptTemplate\n",
+    "\n",
+    "prompt_template = PromptTemplate(SingleLabelClassify.PROMPT_TEMPLATE)\n",
+    "print(prompt_template.to_prompt(text=text_to_classify, label=\"\").items[0].text)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The model would then complete our instruction, thus generating a matching label.\n",
+    "\n",
+    "In the case of single-label classification, however, we already know all possible classes beforehand.\n",
+    "Because of this, all we are interested in is the probability that the model would have generated our specific classes.\n",
+    "To get this probability, we can prompt the model with each of our classes and ask it to return the \"logprobs\" for the text.\n",
+    "\n",
+    "In the case of prompt-based classification, the base prompt looks something like this:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prompt_template = PromptTemplate(SingleLabelClassify.PROMPT_TEMPLATE)\n",
+    "print(prompt_template.to_prompt(text=text_to_classify, label=\" \" +labels[0]).items[0].text)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As you can see, we have the same prompt, but with a potential label candidate already filled in.\n",
+    "Now, we will ask the model to evaluate the likelihood of this label, i.e. completion.\n",
+    "\n",
+    "Our request will not generate any tokens, but instead return the log probability of this completion given the previous tokens.\n",
+    "This is called an `EchoTask`.\n",
+    "Let's have a look at just one of these tasks triggered by our classification run.\n",
+    "\n",
+    "In particular, note the `expected_completion` in the `Input` and the `prob` for the token \" angry\" in the `Output`.\n",
+    "Feel free to ignore the big `Complete` task dump in the middle."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "debug_log.logs[-1].logs[0].logs[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that we have the logprobs, we just need to do some calculations to turn them into a final score.\n",
+    "\n",
+    "To turn the logprobs into our end scores, we first normalize our probabilities.\n",
+    "For this, we utilize a probability tree."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from intelligence_layer.use_cases.classify.single_label_classify import TreeNode\n",
+    "from intelligence_layer.core.logger import LogEntry\n",
+    "\n",
+    "task_log = debug_log.logs[-1]\n",
+    "normalized_probs_logs = [log_entry.value for log_entry in task_log.logs if isinstance(log_entry, LogEntry) and log_entry.message == \"Normalized Probs\"]\n",
+    "log = normalized_probs_logs[-1]\n",
+    "\n",
+    "root = TreeNode()\n",
+    "for probs in log.values():\n",
+    "    root.insert_without_calculation(probs)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, we take the product of all the paths to get the following results:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for label, score in output.scores.items():\n",
+    "    print(f\"{label}: {round(score, 5)}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The example mentioned before is rather straightforward, but there are some situations when it isn't as obvious as a single token.\n",
+    "\n",
+    "What if we take some labels that have overlapping tokens?\n",
+    "This makes the calculation a bit more complicated:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from intelligence_layer.use_cases.classify.single_label_classify import SingleLabelClassify, ClassifyInput\n",
+    "from intelligence_layer.core.logger import LogEntry\n",
+    "\n",
+    "\n",
+    "labels = [\"Space party\", \"Space exploration\", \"Space exploration party\"]\n",
+    "task = SingleLabelClassify(client)\n",
+    "input = ClassifyInput(\n",
+    "    chunk=text_to_classify,\n",
+    "    labels=labels\n",
+    ")\n",
+    "logger = InMemoryDebugLogger(name=\"classify\")\n",
+    "output = task.run(input, logger)\n",
+    "task_log = logger.logs[-1]\n",
+    "normalized_probs_logs = [log_entry.value for log_entry in task_log.logs if isinstance(log_entry, LogEntry) and log_entry.message == \"Normalized Probs\"]\n",
+    "log = normalized_probs_logs.pop()\n",
+    "\n",
+    "root = TreeNode()\n",
+    "for probs in log.values():\n",
+    "    root.insert_without_calculation(probs)\n",
+    "\n",
+    "print(\"End scores:\")\n",
+    "for label, score in output.scores.items():\n",
+    "    print(f\"{label}: {round(score, 4)}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here, the three classes have some overlapping tokens, namely \"Space\", and \"exploration\".\n",
+    "\"party\" is not overlapping, because it occurs in two different places (after \"Space\" and after \"exploration\").\n",
+    "\n",
+    "Cool, so we now figured out how to do prompt-based classification.\n",
+    "Let's have a look at another classification use-case!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Embedding-based multi-label classification\n",
+    "\n",
+    "Large language model embeddings offer a powerful approach to text classification.\n",
+    "In particular, such embeddings can be seen as a numerical representation of the meaning of a text.\n",
+    "Utilizing this, we can provide textual examples for each label and embed them to create a representations for each label in vector space.\n",
+    "\n",
+    "**Or, in more detail**:\n",
+    "In this method, each example from various classes is transformed into a vector representation using the embeddings from the language model.\n",
+    "These embedded vectors capture the semantic essence of the text.\n",
+    "Once this is done, clusters of embeddings are formed for each class, representing the centroid or the average meaning of the examples within that class.\n",
+    "When a new piece of text needs to be classified, it is first embedded using the same language model.\n",
+    "This new embedded vector is then compared to the pre-defined clusters for each class using a cosine similarity.\n",
+    "The class whose cluster is closest to the new text's embedding is then assigned to the text, thereby achieving classification.\n",
+    "This method leverages the deep semantic understanding of large language models to classify texts with high accuracy and nuance.\n",
+    "\n",
+    "### When should you use embedding-based classification?\n",
+    "\n",
+    "We recommend using this type of classification when...\n",
+    "- ...proper classification requires fine-grained control over the classes' definitions.\n",
+    "- ...the labels can be defined mostly or purely by the semantic meaning of the examples.\n",
+    "- ...examples for each label are readily available.\n",
+    "\n",
+    "### Example snippet\n",
+    "\n",
+    "Let's start by instantiating a classifier for sentiment classification."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from intelligence_layer.use_cases.classify.embedding_based_classify import EmbeddingBasedClassify, LabelWithExamples\n",
+    "\n",
+    "\n",
+    "labels_with_examples = [\n",
+    "    LabelWithExamples(\n",
+    "        name=\"positive\",\n",
+    "        examples=[\n",
+    "            \"I really like this.\",\n",
+    "            \"Wow, your hair looks great!\",\n",
+    "            \"We're so in love.\",\n",
+    "            \"That truly was the best day of my life!\",\n",
+    "            \"What a great movie.\"\n",
+    "        ],\n",
+    "    ),\n",
+    "    LabelWithExamples(\n",
+    "        name=\"negative\",\n",
+    "        examples=[\n",
+    "            \"I really dislike this.\",\n",
+    "            \"Ugh, Your hair looks horrible!\",\n",
+    "            \"We're not in love anymore.\",\n",
+    "            \"My day was very bad, I did not have a good time.\",\n",
+    "            \"They make terrible food.\"\n",
+    "        ],\n",
+    "    ),\n",
+    "]\n",
+    "classify = EmbeddingBasedClassify(labels_with_examples, client)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are several things to note here, in particular:\n",
+    "- This time, we instantiated our classification task with a number of `LabelWithExamples`.\n",
+    "- The examples provided should reflect the spectrum of texts expected in the intended usage domain of this classifier.\n",
+    "- This cell took some time to run.\n",
+    "This is because we instantiate a retriever in the background, which also requires us to embed the provided examples.\n",
+    "\n",
+    "With that being said, let's run an unknown example!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "classify_input = ClassifyInput(\n",
+    "    chunk=\"It was very awkward with him, I did not enjoy it.\",\n",
+    "    labels=frozenset(l.name for l in labels_with_examples)\n",
+    ")\n",
+    "logger = InMemoryDebugLogger(name=\"Classify\")\n",
+    "result = classify.run(classify_input, logger)\n",
+    "result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Nice, we correctly identified the new example.\n",
+    "\n",
+    "Again, let's appreciate the difference of this result compared to `SingleLabelClassify`'s result.\n",
+    "- The probabilities do not add up to 1.\n",
+    "In fact, we have no way of predicting what the sum of all scores will be.\n",
+    "In some cases, individual scores may even be negative.\n",
+    "All we know is that the highest score is likely to correspond to the best fitting label, provided we delivered good examples.\n",
+    "- We were much quicker to obtain a result.\n",
+    "\n",
+    "Because all examples are pre-embedded, this classifier is much cheaper to operate as it only requires a single embedding-task to be sent to the Aleph Alpha API.\n",
+    "\n",
+    "Let's try another example. This time, we expect the outcome to be positive.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "classify_input = ClassifyInput(\n",
+    "    chunk=\"We used to be not like each other, but this changed a lot.\",\n",
+    "    labels=frozenset(l.name for l in labels_with_examples)\n",
+    ")\n",
+    "logger = InMemoryDebugLogger(name=\"Classify\")\n",
+    "result = classify.run(classify_input, logger)\n",
+    "result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Unfortunately, we wrongly classify this text as negative.\n",
+    "To be fair, it is a difficult example.\n",
+    "But no worries, let's simply include this failing example in our list of label examples and try again!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from os import getenv\n",
+    "\n",
+    "from aleph_alpha_client import Client\n",
+    "\n",
+    "from intelligence_layer.use_cases.classify.embedding_based_classify import EmbeddingBasedClassify, LabelWithExamples\n",
+    "\n",
+    "\n",
+    "client = Client(getenv(\"AA_TOKEN\"))\n",
+    "labels_with_examples = [\n",
+    "    LabelWithExamples(\n",
+    "        name=\"positive\",\n",
+    "        examples=[\n",
+    "            \"I really like this.\",\n",
+    "            \"Wow, your hair looks great!\",\n",
+    "            \"We're so in love.\",\n",
+    "            \"That truly was the best day of my life!\",\n",
+    "            \"What a great movie.\",\n",
+    "            \"We used to be not like each other, but this changed a lot.\" # failing example\n",
+    "        ],\n",
+    "    ),\n",
+    "    LabelWithExamples(\n",
+    "        name=\"negative\",\n",
+    "        examples=[\n",
+    "            \"I really dislike this.\",\n",
+    "            \"Ugh, Your hair looks horrible!\",\n",
+    "            \"We're not in love anymore.\",\n",
+    "            \"My day was very bad, I did not have a good time.\",\n",
+    "            \"They make terrible food.\"\n",
+    "        ],\n",
+    "    ),\n",
+    "]\n",
+    "classify = EmbeddingBasedClassify(labels_with_examples, client)\n",
+    "\n",
+    "logger = InMemoryDebugLogger(name=\"Classify\")\n",
+    "result = classify.run(classify_input, logger)\n",
+    "result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Nice, we now correctly classify this example!\n",
+    "\n",
+    "One advantage of using the `EmbeddingBasedClassify`-approach is that we can easily tweak our labels by adding new examples.\n",
+    "In essence, this guarantees that we never make the same mistake twice.\n",
+    "As we increase the number of examples, this makes the method evermore precise.\n",
+    "\n",
+    "You now have an overview of these two main methods of classification!\n",
+    "Feel free to tweak these method and play around with their parameters to finetune them to our specific use-case."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "3.10-intelligence",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/src/examples/embedding_based_classify.ipynb b/src/examples/embedding_based_classify.ipynb
deleted file mode 100644
index 14e5b1121..000000000
--- a/src/examples/embedding_based_classify.ipynb
+++ /dev/null
@@ -1,120 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Embedding-Based Classification\n",
-    "\n",
-    "Large language model embeddings offer a powerful approach to text classification.\n",
-    "In this method, each example from various classes is transformed into a vector representation using the embeddings from the language model.\n",
-    "These embedded vectors capture the semantic essence of the text.\n",
-    "Once this is done, clusters of embeddings are formed for each class, representing the centroid or the average meaning of the examples within that class.\n",
-    "When a new piece of text needs to be classified, it is first embedded using the same language model.\n",
-    "This new embedded vector is then compared to the pre-defined clusters for each class using a cosine similarity.\n",
-    "The class whose cluster is closest to the new text's embedding is then assigned to the text, thereby achieving classification.\n",
-    "This method leverages the deep semantic understanding of large language models to classify texts with high accuracy and nuance.\n",
-    "\n",
-    "### When should you use embedding-based classification?\n",
-    "\n",
-    "We recommend using this type of classification when...\n",
-    "- ...proper classification requires fine-grained control over the classes' definitions.\n",
-    "- ...the labels can be defined mostly or purely by the semantic meaning of the examples.\n",
-    "- ...examples for each label are readily available.\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Let's start by instantiating a classifier for sentiment classification."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from os import getenv\n",
-    "\n",
-    "from aleph_alpha_client import Client\n",
-    "\n",
-    "from intelligence_layer.use_cases.classify.embedding_based_classify import EmbeddingBasedClassify, LabelWithExamples\n",
-    "\n",
-    "\n",
-    "client = Client(getenv(\"AA_TOKEN\"))\n",
-    "labels_with_examples = [\n",
-    "    LabelWithExamples(\n",
-    "        name=\"positive\",\n",
-    "        examples=[\n",
-    "            \"I really like this.\",\n",
-    "            \"Wow, your hair looks great!\",\n",
-    "            \"We're so in love.\",\n",
-    "            \"That truly was the best day of my life!\",\n",
-    "            \"What a great movie.\"\n",
-    "        ],\n",
-    "    ),\n",
-    "    LabelWithExamples(\n",
-    "        name=\"negative\",\n",
-    "        examples=[\n",
-    "            \"I really dislike this.\",\n",
-    "            \"Ugh, Your hair looks horrible!\",\n",
-    "            \"We're not in love anymore.\",\n",
-    "            \"My day was very bad, I did not have a good time.\",\n",
-    "            \"They make terrible food.\"\n",
-    "        ],\n",
-    "    ),\n",
-    "]\n",
-    "classify = EmbeddingBasedClassify(labels_with_examples, client)\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Alright, let's classify a new example!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from intelligence_layer.core.logger import InMemoryDebugLogger\n",
-    "from intelligence_layer.use_cases.classify.classify import ClassifyInput\n",
-    "\n",
-    "\n",
-    "classify_input = ClassifyInput(\n",
-    "    chunk=\"It was very awkward with him, I did not enjoy it.\",\n",
-    "    labels=frozenset(l.name for l in labels_with_examples)\n",
-    ")\n",
-    "logger = InMemoryDebugLogger(name=\"Classify\")\n",
-    "result = classify.run(classify_input, logger)\n",
-    "result"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "3.10-intelligence",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.11.4"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/src/examples/evaluation.ipynb b/src/examples/evaluation.ipynb
new file mode 100644
index 000000000..0b34ea99d
--- /dev/null
+++ b/src/examples/evaluation.ipynb
@@ -0,0 +1,339 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Evaluating LLM-based tasks\n",
+    "\n",
+    "Evaluating LLM-based use cases is pivotal for several reasons.\n",
+    "First, with the myriad of methods available, comparability becomes essential.\n",
+    "By systematically evaluating different approaches, we can discern which techniques are more effective or suited for specific tasks, fostering a deeper understanding of their strengths and weaknesses.\n",
+    "Secondly, optimization plays a significant role. Without proper evaluation metrics and rigorous testing, it becomes challenging to fine-tune methods and/or models to achieve their maximum potential.\n",
+    "Moreover, drawing comparisons with state-of-the-art (SOTA) and open-source methods is crucial.\n",
+    "Such comparisons not only provide benchmarks but also enable users to determine the value-added by proprietary or newer models over freely available counterparts.\n",
+    "\n",
+    "However, evaluating LLMs, especially in the domain of text generation, presents unique challenges.\n",
+    "Text generation is inherently subjective, and what one evaluator deems coherent and relevant, another might find disjointed or off-topic. This subjectivity complicates the establishment of universal evaluation standards, making it imperative to approach LLM evaluation with a multifaceted and comprehensive strategy.\n",
+    "\n",
+    "### Evaluating classification use-cases\n",
+    "\n",
+    "To (at least for now) evade the elusive issue described in the last paragraph, let's have a look at an easier to evaluate methodology: classification.\n",
+    "Why is this easier?\n",
+    "Well, unlike other tasks such as QA, the result of a classification task is more or less binary (true/false).\n",
+    "There are very few grey areas, as it is unlikely that a classification result is somewhat or \"half\" correct.\n",
+    "\n",
+    "Make sure that you have familiarized yourself with the `SingleLabelClassify` and `EmbeddingBasedClassify` prior to starting this notebook.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First, we need to instantiate our task and an evaluator for it.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "from aleph_alpha_client import Client\n",
+    "from intelligence_layer.use_cases.classify.classify import ClassifyEvaluator\n",
+    "from intelligence_layer.use_cases.classify.single_label_classify import SingleLabelClassify\n",
+    "\n",
+    "\n",
+    "client = Client(os.getenv(\"AA_TOKEN\"))\n",
+    "task = SingleLabelClassify(client)\n",
+    "evaluator = ClassifyEvaluator(task)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's run a single example and see what comes of it!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from intelligence_layer.core.logger import InMemoryDebugLogger\n",
+    "from intelligence_layer.core.task import Chunk\n",
+    "from intelligence_layer.use_cases.classify.classify import ClassifyInput\n",
+    "\n",
+    "\n",
+    "classify_input = ClassifyInput(\n",
+    "        chunk=Chunk(\"This is good\"),\n",
+    "        labels=frozenset({\"positive\", \"negative\"}),\n",
+    "    )\n",
+    "evaluation_logger = InMemoryDebugLogger(name=\"Evaluation Logger\")\n",
+    "expected_output = \"positive\"\n",
+    "evaluation = evaluator.evaluate(\n",
+    "    input=classify_input, logger=evaluation_logger, expected_output=[expected_output]\n",
+    ")\n",
+    "\n",
+    "print(\"The task result:\", evaluation.output.scores)\n",
+    "print(\"The expected output:\", expected_output)\n",
+    "print(\"The eval result:\", evaluation.correct)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Cool!\n",
+    "\n",
+    "Let's have a look at this [dataset](https://huggingface.co/cardiffnlp/tweet-topic-21-multi) for more elaborate evaluaton."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "dataset = load_dataset(f\"cardiffnlp/tweet_topic_multi\")\n",
+    "test_set_name = \"validation_random\"\n",
+    "all_data = list(dataset[test_set_name])\n",
+    "data, all_data = all_data[:10], all_data[10:] # this has 573 datapoints, let's take a look at 10 for now\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We need to transform our dataset into the required format. \n",
+    "Therefore, let's check out what it looks like."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data[1]\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Accordingly, this must be translated into the interface of our `Evaluator`.\n",
+    "\n",
+    "This is the target structure:\n",
+    "\n",
+    "``` python\n",
+    "class Example(BaseModel, Generic[Input, ExpectedOutput]):\n",
+    "    input: Input\n",
+    "    expected_output: ExpectedOutput\n",
+    "    ident: Optional[str] = Field(default_factory=lambda: str(uuid4()))\n",
+    "\n",
+    "\n",
+    "class Dataset(BaseModel, Generic[Input, ExpectedOutput]):\n",
+    "    name: str\n",
+    "    examples: Sequence[Example[Input, ExpectedOutput]]\n",
+    "```\n",
+    "\n",
+    "We want the `input` in each `Example` to mimic the input of an actual task, therefore we must every time include the text (chunk) and all possible labels.\n",
+    "The `expected_output` shall correspond to anything we wish to compare our generated output to.\n",
+    "In this case, that means the correct class(es)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from intelligence_layer.core.evaluator import Example, Dataset\n",
+    "\n",
+    "\n",
+    "all_labels = list(set(c for d in data for c in d[\"label_name\"]))\n",
+    "dataset = Dataset(\n",
+    "    name=\"tweet topics\",\n",
+    "    examples=[\n",
+    "        Example(\n",
+    "            input=ClassifyInput(\n",
+    "                chunk=Chunk(d[\"text\"]),\n",
+    "                labels=all_labels\n",
+    "            ),\n",
+    "            expected_output=d[\"label_name\"]\n",
+    "        ) for d in data\n",
+    "    ]\n",
+    ")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Ok, let's run this!\n",
+    "\n",
+    "Note that this may take a while as we parallelise the tasks in a way that accommodates the inference API."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "evaluation_logger = InMemoryDebugLogger(name=\"Dataset Evaluation Logger\")\n",
+    "result = evaluator.evaluate_dataset(dataset=dataset, logger=evaluation_logger)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Checking out the results..."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Percentage correct:\", result.percentage_correct)\n",
+    "print(\"First example:\", result.evaluations[0])\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Looking good!\n",
+    "\n",
+    "Because we designed the `ClassifyEvaluator` in a way that allows it to evaluate any `Task` with `ClassifyInput` and `ClassifyOutput`, it can even evaluate different classifier implementations, such as the `EmbeddingBasedClassifier`.\n",
+    "\n",
+    "To achieve this, let's first find some examples for the different labels within our eval set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pprint import pprint\n",
+    "from typing import Mapping, Sequence\n",
+    "\n",
+    "from intelligence_layer.use_cases.classify.embedding_based_classify import LabelWithExamples\n",
+    "\n",
+    "labels_with_examples_dict: Mapping[str, Sequence[str]] = {}\n",
+    "for d in all_data:\n",
+    "    for label in d[\"label_name\"]:\n",
+    "        if label in labels_with_examples_dict:\n",
+    "            labels_with_examples_dict[label].append(d[\"text\"])\n",
+    "        else:\n",
+    "            labels_with_examples_dict[label] = [d[\"text\"]]\n",
+    "labels_with_examples = [LabelWithExamples(name=k, examples=v) for k, v in labels_with_examples_dict.items()]\n",
+    "pprint({k: v[:1] for k, v in labels_with_examples_dict.items()})\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Alright, let's instantiate our `EmbeddingBasedClassify`-task with these examples.\n",
+    "Again, this may take a few seconds, because we embed all examples."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from intelligence_layer.use_cases.classify.embedding_based_classify import EmbeddingBasedClassify\n",
+    "\n",
+    "\n",
+    "ebc = EmbeddingBasedClassify(labels_with_examples, client)\n",
+    "ebc_evaluator = ClassifyEvaluator(ebc)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And, let's run!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ebc_evaluation_logger = InMemoryDebugLogger(name=\"Dataset Evaluation Logger 2\")\n",
+    "ebc_result = ebc_evaluator.evaluate_dataset(dataset=dataset, logger=ebc_evaluation_logger)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once again, let's print part of the result."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Percentage correct:\", ebc_result.percentage_correct)\n",
+    "print(\"First example:\", ebc_result.evaluations[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As we can see, our `EmbeddingBasedClassify` outperformed the prompt-based approach here.\n",
+    "However, also note the small sample size of 10.\n",
+    "To achieve statistical signifance in evaluation, we generally recommend evaluating on at least 100, if not 1000, examples.\n",
+    "\n",
+    "In the case at hand, we can note that the embedding-based approach likely benefitted from the large examples we were able to provide on the basis of the extensive dataset.\n",
+    "Generally, we recommend using this approach once you can provide around 10 or more examples per label.\n",
+    "\n",
+    "### Wrap up\n",
+    "\n",
+    "There you go, this is how to evaluate any task using the Intelligence Layer framework.\n",
+    "Simply define an `Evaluator` that takes the target `Task` as input and customize the `evaluate` as well as `aggregate` methods."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "intelligence-layer-tfT-HG2V-py3.11",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/src/examples/single_label_classify.ipynb b/src/examples/single_label_classify.ipynb
deleted file mode 100644
index 4d49d67e5..000000000
--- a/src/examples/single_label_classify.ipynb
+++ /dev/null
@@ -1,378 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Single Label Classification\n",
-    "\n",
-    "Single-label classification, also known as single-class or binary classification, refers to the task of categorizing data points into one of n distinct categories or classes.\n",
-    "In this type of classification, each input is assigned to only one class, ensuring that no overlap exists between categories.\n",
-    "Common applications of single-label classification include email spam detection, where emails are classified as either \"spam\" or \"not spam\", or sentiment classification, where a text can be \"positive\", \"negative\" or \"neutral\".\n",
-    "The primary goal is to train a model that can accurately predict the correct class for any given input based on its features.\n",
-    "\n",
-    "### Prompt-based classification\n",
-    "\n",
-    "Here, we'll use a purely prompt-based approach for classification.\n",
-    "\n",
-    "### When should you use prompt-based classification?\n",
-    "\n",
-    "We recommend using this type of classification when...\n",
-    "- ...the labels are easily understood (they don't require explanation or examples).\n",
-    "- ...the labels cannot be recognized purely by their semantic meaning.\n",
-    "- ...many examples for each label aren't readily available.\n",
-    "\n",
-    "### Example snippet\n",
-    "\n",
-    "Running the following code will instantiate a prompt-based classifier with a debug level for the log.\n",
-    "Then it will classify the text given in `ClassifyInput`.\n",
-    "The contents of the `debug_log` will be shown below.\n",
-    "It gives an overview of the steps taken to get the result.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from os import getenv\n",
-    "\n",
-    "from aleph_alpha_client import Client\n",
-    "\n",
-    "from intelligence_layer.use_cases.classify.single_label_classify import ClassifyInput, SingleLabelClassify\n",
-    "from intelligence_layer.core.task import Chunk\n",
-    "from intelligence_layer.core.logger import InMemoryDebugLogger\n",
-    "\n",
-    "text_to_classify = Chunk(\"In the distant future, a space exploration party embarked on a thrilling journey to the uncharted regions of the galaxy. \\n\\\n",
-    "With excitement in their hearts and the cosmos as their canvas, they ventured into the unknown, discovering breathtaking celestial wonders. \\n\\\n",
-    "As they gazed upon distant stars and nebulas, they forged unforgettable memories that would forever bind them as pioneers of the cosmos.\")\n",
-    "labels = [\"happy\", \"angry\", \"sad\"]\n",
-    "client = Client(getenv(\"AA_TOKEN\"))\n",
-    "task = SingleLabelClassify(client)\n",
-    "input = ClassifyInput(\n",
-    "    chunk=text_to_classify,\n",
-    "    labels=labels\n",
-    ")\n",
-    "\n",
-    "debug_log = InMemoryDebugLogger(name=\"classify\")\n",
-    "output = task.run(input, debug_log)\n",
-    "for label, score in output.scores.items():\n",
-    "    print(f\"{label}: {round(score, 4)}\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### How does this implementation work?\n",
-    "\n",
-    "For prompt-based classification, we prompt the model multiple times with the text we want to classify and each of our classes.\n",
-    "Instead of letting the model generate the class it thinks fits the text best, we ask it for the probability for each class.\n",
-    "\n",
-    "To further explain this, let's start with a more familiar case.\n",
-    "Intuitively, one would probably prompt a model like so:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from aleph_alpha_client import PromptTemplate\n",
-    "\n",
-    "prompt_template = PromptTemplate(SingleLabelClassify.PROMPT_TEMPLATE)\n",
-    "print(prompt_template.to_prompt(text=text_to_classify, label=\"\").items[0].text)\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The model would then answer our question and generate a class or label that it thinks fits the text best.\n",
-    "\n",
-    "In the case of classification, however, we already know all possible classes beforehand.\n",
-    "Because of this, all we are interested in is the probability that the model would have generated our specific classes.\n",
-    "To get this probability, we can prompt the model with each of our classes and ask it to return the \"logprobs\" for the text.\n",
-    "\n",
-    "In the case of prompt-based classification, the base prompt looks something like this:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "prompt_template = PromptTemplate(SingleLabelClassify.PROMPT_TEMPLATE)\n",
-    "print(prompt_template.to_prompt(text=text_to_classify, label=\" \" +labels[0]).items[0].text)\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "As you can see, we have the same prompt, but with a potential label candidate already filled in.\n",
-    "\n",
-    "Now, we will ask the model to evaluate the likelihood of this completion.\n",
-    "\n",
-    "Our request will now not generate any tokens, but instead return the log probability of this completion given the previous tokens."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now that we have the logprobs, we just need to do some calculations to turn them into a final score.\n",
-    "\n",
-    "To turn the logprobs into our end scores, we first normalize our probabilities.\n",
-    "For this, we utilize a probability tree."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from intelligence_layer.use_cases.classify.single_label_classify import TreeNode\n",
-    "from intelligence_layer.core.logger import LogEntry\n",
-    "\n",
-    "task_log = debug_log.logs[-1]\n",
-    "normalized_probs_logs = [log_entry.value for log_entry in task_log.logs if isinstance(log_entry, LogEntry) and log_entry.message == \"Normalized Probs\"]\n",
-    "log = normalized_probs_logs[-1]\n",
-    "\n",
-    "root = TreeNode()\n",
-    "for probs in log.values():\n",
-    "    root.insert_without_calculation(probs)\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Finally, we take the product of all the paths to get the following results:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "for label, score in output.scores.items():\n",
-    "    print(f\"{label}: {round(score, 5)}\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The example mentioned before is rather straightforward, but there are some situations when it isn't as obvious as a single token.\n",
-    "\n",
-    "What if we take some classes that have some overlap?\n",
-    "In the following example, some of the classes overlap in the tokens they have.\n",
-    "This makes the calculation a bit more complicated:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from intelligence_layer.use_cases.classify.single_label_classify import SingleLabelClassify, ClassifyInput\n",
-    "from intelligence_layer.core.logger import LogEntry\n",
-    "\n",
-    "\n",
-    "labels = [\"Space party\", \"Space exploration\", \"Space exploration party\"]\n",
-    "task = SingleLabelClassify(client)\n",
-    "input = ClassifyInput(\n",
-    "    chunk=text_to_classify,\n",
-    "    labels=labels\n",
-    ")\n",
-    "logger = InMemoryDebugLogger(name=\"classify\")\n",
-    "output = task.run(input, logger)\n",
-    "task_log = logger.logs[-1]\n",
-    "normalized_probs_logs = [log_entry.value for log_entry in task_log.logs if isinstance(log_entry, LogEntry) and log_entry.message == \"Normalized Probs\"]\n",
-    "log = normalized_probs_logs.pop()\n",
-    "\n",
-    "root = TreeNode()\n",
-    "for probs in log.values():\n",
-    "    root.insert_without_calculation(probs)\n",
-    "\n",
-    "print(\"End scores:\")\n",
-    "for label, score in output.scores.items():\n",
-    "    print(f\"{label}: {round(score, 4)}\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Here, the three classes have some overlapping tokens, namely \"Space\", and \"exploration\".\n",
-    "\"party\" is not overlapping, because it occurs in two different places (after \"Space\" and after \"exploration\")."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Cool!\n",
-    "Now, let's evaluate how well our new methodology is working.\n",
-    "For this, we will first look for classification datasets to use.\n",
-    "We found this [dataset](https://huggingface.co/cardiffnlp/tweet-topic-21-multi) on huggingface, let's see if we can get an evaluation going!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from datasets import load_dataset\n",
-    "\n",
-    "dataset = load_dataset(f\"cardiffnlp/tweet_topic_multi\")\n",
-    "test_set_name = \"validation_random\"\n",
-    "data = list(dataset[test_set_name])[:10] # this has 573 datapoints, let's take a look at 20 for now\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Next, we need to instantiate an evaluator that takes our classify methodology (`task`) and some datapoints and returns some evaluation metrics.\n",
-    "\n",
-    "First, let's evaluate a single example and see what happens."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from intelligence_layer.use_cases.classify.classify import ClassifyEvaluator\n",
-    "\n",
-    "evaluator = ClassifyEvaluator(task)\n",
-    "classify_input = ClassifyInput(\n",
-    "        chunk=Chunk(\"This is good\"),\n",
-    "        labels=frozenset({\"positive\", \"negative\"}),\n",
-    "    )\n",
-    "evaluation_logger = InMemoryDebugLogger(name=\"evaluation logger\")\n",
-    "expected_output = \"positive\"\n",
-    "evaluation = evaluator.evaluate(\n",
-    "    input=classify_input, logger=evaluation_logger, expected_output=[expected_output]\n",
-    ")\n",
-    "\n",
-    "print(\"The task result:\", evaluation.output.scores)\n",
-    "print(\"The expected output:\", expected_output)\n",
-    "print(\"The eval result:\", evaluation.correct)\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We need to transform our dataset into the required format. \n",
-    "Therefore, let's check out what it looks like."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "data[1]\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Accordingly, this must be translated into the interface of our `Evaluator`."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from intelligence_layer.core.evaluator import Example, Dataset\n",
-    "\n",
-    "\n",
-    "all_labels = list(set(c for d in data for c in d[\"label_name\"]))\n",
-    "dataset = Dataset(\n",
-    "    name=\"tweet topics\",\n",
-    "    examples=[\n",
-    "        Example(\n",
-    "            input=ClassifyInput(\n",
-    "                chunk=d[Chunk(\"text\")],\n",
-    "                labels=all_labels\n",
-    "            ),\n",
-    "            expected_output=d[\"label_name\"]\n",
-    "        ) for d in data\n",
-    "    ]\n",
-    ")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Ok, let's run this!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "evaluation_logger = InMemoryDebugLogger(name=\"evaluation logger\")\n",
-    "result = evaluator.evaluate_dataset(dataset=dataset, logger=evaluation_logger)\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Checking out the results..."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print(\"Percentage correct:\", result.percentage_correct)\n",
-    "print(\"First example:\", result.evaluations[0])\n"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "3.10-intelligence",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.11.4"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}