docs: how_to_resume_a_run_after_a_crash (#919)

Task: PHS-600 Co-authored-by: FelixFehse <[email protected]>
Aleph-Alpha · Jun 19, 2024 · ae94ea9 · ae94ea9
1 parent 1237e88
commit ae94ea9
Show file tree

Hide file tree

Showing 4 changed files with 131 additions and 1 deletion.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,7 @@
    - For `InMemoryRunRepository` based `Runner`s this is limited to runs that failed with an exception that did not crash the whole process/kernel.
    - For `FileRunRepository` based `Runners` even runs that crashed the whole process can be resumed.
    - `DatasetRepository.examples` now accepts an optional parameter `examples_to_skip` to enable skipping of `Example`s with the provided IDs.
+   - Add `how_to_resume_a_run_after_a_crash` notebook.
 
 ### Fixes
 ...

diff --git a/README.md b/README.md
@@ -154,6 +154,7 @@ The how-tos are quick lookups about how to do things. Compared to the tutorials,
 | [...implement a simple evaluation and aggregation logic](./src/documentation/how_tos/how_to_implement_a_simple_evaluation_and_aggregation_logic.ipynb) | Basic examples of evaluation and aggregation logic                         |
 | [...create a dataset](./src/documentation/how_tos/how_to_create_a_dataset.ipynb)                                                                       | Create a dataset used for running a task                                   |
 | [...run a task on a dataset](./src/documentation/how_tos/how_to_run_a_task_on_a_dataset.ipynb)                                                         | Run a task on a whole dataset instead of single examples                   |
+| [...resume a run after a crash](./src/documentation/how_tos/how_to_resume_a_run_after_a_crash.ipynb) | Resume a run after a crash or exception occurred |
 | [...evaluate multiple runs](./src/documentation/how_tos/how_to_evaluate_runs.ipynb)                                                                    | Evaluate (multiple) runs in a single evaluation                            |
 | [...aggregate multiple evaluations](./src/documentation/how_tos/how_to_aggregate_evaluations.ipynb)                                                    | Aggregate (multiple) evaluations in a single aggregation                   |
 | [...retrieve data for analysis](./src/documentation/how_tos/how_to_retrieve_data_for_analysis.ipynb)                                                   | Retrieve experiment data in multiple different ways                        |

diff --git a/src/documentation/how_tos/example_data.py b/src/documentation/how_tos/example_data.py
@@ -34,6 +34,21 @@ def do_run(self, input: str, task_span: TaskSpan) -> str:
         return f"{input} -> output"
 
 
+EXAMPLE_1_INPUT = "input1"
+
+
+class DummyTaskCanFail(Task[str, str]):
+    def __init__(self) -> None:
+        super().__init__()
+        self._raise_exception = True
+
+    def do_run(self, input: str, task_span: TaskSpan) -> str:
+        if input == EXAMPLE_1_INPUT and self._raise_exception:
+            self._raise_exception = False
+            raise Exception("Some random failure in the system.")
+        return f"{input} -> output"
+
+
 class DummyEvaluation(BaseModel):
     eval: str
 
@@ -102,7 +117,9 @@ class ExampleData:
 def example_data() -> ExampleData:
     examples = [
         DummyExample(input="input0", expected_output="expected_output0", data="data0"),
-        DummyExample(input="input1", expected_output="expected_output1", data="data1"),
+        DummyExample(
+            input=EXAMPLE_1_INPUT, expected_output="expected_output1", data="data1"
+        ),
     ]
 
     dataset_repository = InMemoryDatasetRepository()

diff --git a/src/documentation/how_tos/how_to_resume_a_run_after_a_crash.ipynb b/src/documentation/how_tos/how_to_resume_a_run_after_a_crash.ipynb
@@ -0,0 +1,111 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pytest\n",
+    "from example_data import DummyTaskCanFail, example_data\n",
+    "\n",
+    "from intelligence_layer.evaluation.run.in_memory_run_repository import (\n",
+    "    InMemoryRunRepository,\n",
+    ")\n",
+    "from intelligence_layer.evaluation.run.runner import Runner\n",
+    "\n",
+    "my_example_data = example_data()\n",
+    "\n",
+    "dataset_repository = my_example_data.dataset_repository\n",
+    "run_repository = InMemoryRunRepository()\n",
+    "task = DummyTaskCanFail()\n",
+    "\n",
+    "runner = Runner(task, dataset_repository, run_repository, \"MyRunDescription\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# How to resume a run after a crash\n",
+    "\n",
+    "0. Run task on a dataset, see [here](./how_to_run_a_task_on_a_dataset.ipynb).\n",
+    "1. A crash occurs.\n",
+    "2. Re-run task on the same dataset with `resume_from_recovery_data` set to `True`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Steps 0 & 1: Run task for dataset\n",
+    "with pytest.raises(Exception):  # noqa: B017\n",
+    "    run_overview = runner.run_dataset(my_example_data.dataset.id, abort_on_error=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A failure has occurred. Note, this might be a crash of the computer or an unexpected uncaught exception. \n",
+    "\n",
+    "For demonstration purposes, we set `abort_on_error=True`, such that an exception is raised. Further, we catch the exception for purely technical reasons of our CI. Feel free to remove the pytest scope on your local setup when running this notebook.\n",
+    "\n",
+    "Even though the run crashed, the `RunRepository` stores recovery data and is able to continue `run_dataset` by setting `resume_from_recovery_data` to `True`. This way, the already successfully calculated outputs do not have to be re-calculated again, and only the missing examples are processed:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Step 2: Re-run the same run with `resume_from_recovery_data` enabled\n",
+    "run_overview = runner.run_dataset(\n",
+    "    my_example_data.dataset.id, abort_on_error=True, resume_from_recovery_data=True\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(run_overview)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note: The `FileSystemRepository` persists the recovery data in the file system. The run can therefore be resumed even in case of a complete program or even computer crash. \n",
+    "\n",
+    "On the other hand, the `InMemoryRunRepository` retains the recovery data only as long as the repository resides in computer memory. A crash of the process will lead to the loss of the recovery data. In that case, all examples will have to be recalculated."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "intelligence-layer-dgcJwC7l-py3.11",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}