Skip to content

Commit

Permalink
docs: how_to_resume_a_run_after_a_crash (#919)
Browse files Browse the repository at this point in the history
Task: PHS-600

Co-authored-by: FelixFehse <[email protected]>
  • Loading branch information
SebastianNiehusAA and FelixFehse authored Jun 19, 2024
1 parent 1237e88 commit ae94ea9
Show file tree
Hide file tree
Showing 4 changed files with 131 additions and 1 deletion.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
- For `InMemoryRunRepository` based `Runner`s this is limited to runs that failed with an exception that did not crash the whole process/kernel.
- For `FileRunRepository` based `Runners` even runs that crashed the whole process can be resumed.
- `DatasetRepository.examples` now accepts an optional parameter `examples_to_skip` to enable skipping of `Example`s with the provided IDs.
- Add `how_to_resume_a_run_after_a_crash` notebook.

### Fixes
...
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,7 @@ The how-tos are quick lookups about how to do things. Compared to the tutorials,
| [...implement a simple evaluation and aggregation logic](./src/documentation/how_tos/how_to_implement_a_simple_evaluation_and_aggregation_logic.ipynb) | Basic examples of evaluation and aggregation logic |
| [...create a dataset](./src/documentation/how_tos/how_to_create_a_dataset.ipynb) | Create a dataset used for running a task |
| [...run a task on a dataset](./src/documentation/how_tos/how_to_run_a_task_on_a_dataset.ipynb) | Run a task on a whole dataset instead of single examples |
| [...resume a run after a crash](./src/documentation/how_tos/how_to_resume_a_run_after_a_crash.ipynb) | Resume a run after a crash or exception occurred |
| [...evaluate multiple runs](./src/documentation/how_tos/how_to_evaluate_runs.ipynb) | Evaluate (multiple) runs in a single evaluation |
| [...aggregate multiple evaluations](./src/documentation/how_tos/how_to_aggregate_evaluations.ipynb) | Aggregate (multiple) evaluations in a single aggregation |
| [...retrieve data for analysis](./src/documentation/how_tos/how_to_retrieve_data_for_analysis.ipynb) | Retrieve experiment data in multiple different ways |
Expand Down
19 changes: 18 additions & 1 deletion src/documentation/how_tos/example_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,21 @@ def do_run(self, input: str, task_span: TaskSpan) -> str:
return f"{input} -> output"


EXAMPLE_1_INPUT = "input1"


class DummyTaskCanFail(Task[str, str]):
def __init__(self) -> None:
super().__init__()
self._raise_exception = True

def do_run(self, input: str, task_span: TaskSpan) -> str:
if input == EXAMPLE_1_INPUT and self._raise_exception:
self._raise_exception = False
raise Exception("Some random failure in the system.")
return f"{input} -> output"


class DummyEvaluation(BaseModel):
eval: str

Expand Down Expand Up @@ -102,7 +117,9 @@ class ExampleData:
def example_data() -> ExampleData:
examples = [
DummyExample(input="input0", expected_output="expected_output0", data="data0"),
DummyExample(input="input1", expected_output="expected_output1", data="data1"),
DummyExample(
input=EXAMPLE_1_INPUT, expected_output="expected_output1", data="data1"
),
]

dataset_repository = InMemoryDatasetRepository()
Expand Down
111 changes: 111 additions & 0 deletions src/documentation/how_tos/how_to_resume_a_run_after_a_crash.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pytest\n",
"from example_data import DummyTaskCanFail, example_data\n",
"\n",
"from intelligence_layer.evaluation.run.in_memory_run_repository import (\n",
" InMemoryRunRepository,\n",
")\n",
"from intelligence_layer.evaluation.run.runner import Runner\n",
"\n",
"my_example_data = example_data()\n",
"\n",
"dataset_repository = my_example_data.dataset_repository\n",
"run_repository = InMemoryRunRepository()\n",
"task = DummyTaskCanFail()\n",
"\n",
"runner = Runner(task, dataset_repository, run_repository, \"MyRunDescription\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to resume a run after a crash\n",
"\n",
"0. Run task on a dataset, see [here](./how_to_run_a_task_on_a_dataset.ipynb).\n",
"1. A crash occurs.\n",
"2. Re-run task on the same dataset with `resume_from_recovery_data` set to `True`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Steps 0 & 1: Run task for dataset\n",
"with pytest.raises(Exception): # noqa: B017\n",
" run_overview = runner.run_dataset(my_example_data.dataset.id, abort_on_error=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A failure has occurred. Note, this might be a crash of the computer or an unexpected uncaught exception. \n",
"\n",
"For demonstration purposes, we set `abort_on_error=True`, such that an exception is raised. Further, we catch the exception for purely technical reasons of our CI. Feel free to remove the pytest scope on your local setup when running this notebook.\n",
"\n",
"Even though the run crashed, the `RunRepository` stores recovery data and is able to continue `run_dataset` by setting `resume_from_recovery_data` to `True`. This way, the already successfully calculated outputs do not have to be re-calculated again, and only the missing examples are processed:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Step 2: Re-run the same run with `resume_from_recovery_data` enabled\n",
"run_overview = runner.run_dataset(\n",
" my_example_data.dataset.id, abort_on_error=True, resume_from_recovery_data=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(run_overview)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: The `FileSystemRepository` persists the recovery data in the file system. The run can therefore be resumed even in case of a complete program or even computer crash. \n",
"\n",
"On the other hand, the `InMemoryRunRepository` retains the recovery data only as long as the repository resides in computer memory. A crash of the process will lead to the loss of the recovery data. In that case, all examples will have to be recalculated."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "intelligence-layer-dgcJwC7l-py3.11",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit ae94ea9

Please sign in to comment.