The main focus of the Intelligence Layer is to enable developers to
- implement their LLM use cases by building upon and composing existing functionalities
- obtain insights into the runtime behavior of their implementations
- iteratively improve their implementations or compare them to existing implementations by evaluating them against a given set of examples
How these focus points are realized in the Intelligence Layer is described in more detail in the following sections.
At the heart of the Intelligence Layer is a Task
. A task is actually a pretty generic concept that just
transforms an input-parameter to an output like a function in mathematics.
Task: Input -> Output
In Python this is realized by an abstract class with type-parameters and the abstract method do_run
in which the actual transformation is implemented:
class Task(ABC, Generic[Input, Output]):
@abstractmethod
def do_run(self, input: Input, task_span: TaskSpan) -> Output:
...
Input
and Output
are normal Python datatypes that can be serialized from and to JSON. For this the Intelligence
Layer relies on Pydantic. The used types are defined in form
of type-aliases PydanticSerializable.
The second parameter task_span
is used for tracing which is described below.
do_run
is the method that implements a concrete task and has to be provided by the user. It will be executed by the external interface method run
of a
task:
class Task(ABC, Generic[Input, Output]):
@final
def run(self, input: Input, tracer: Tracer) -> Output:
...
The signatures of the do_run
and run
methods differ only in the tracing parameters.
Even though the concept is so generic the main purpose for a task is of course to make use of an LLM for the transformation. Tasks are defined at different levels of abstraction. There are higher level tasks (also called Use Cases) that reflect a typical user problem and there are lower level tasks that are more about interfacing with an LLM on a very generic or even technical level.
Examples for higher level tasks (Use Cases) are:
- Answering a question based on a given document:
QA: (Document, Question) -> Answer
- Generate a summary of a given document:
Summary: Document -> Summary
Examples for lower level tasks are:
- Let the model generate text based on an instruction and some context:
Instruct: (Context, Instruction) -> Completion
- Chunk a text in smaller pieces at optimized boundaries (typically to make it fit into an LLM's context-size):
Chunk: Text -> [Chunk]
Typically you would build higher level tasks from lower level tasks. Given a task you can draw a dependency graph
that illustrates which sub-tasks it is using and in turn which sub-tasks they are using. This graph typically forms a hierarchy or
more general a directed acyclic graph. The following drawing shows this graph for the Intelligence Layer's RecursiveSummarize
task:
A task implements a workflow. It processes its input, passes it on to sub-tasks, processes the outputs of the sub-tasks
and builds its own output. This workflow can be represented in a trace. For this a task's run
method takes a Tracer
that takes care of storing details on the steps of this workflow like the tasks that have been invoked along with their
input and output and timing information. The following illustration shows the trace of an MultiChunkQa-task:
To represent this tracing defines the following concepts:
- A
Tracer
is passed to a task'srun
method and provides methods for openingSpan
s orTaskSpan
s. - A
Span
is aTracer
and allows to group multiple logs and runtime durations together as a single, logical step in the workflow. - A
TaskSpan
is aSpan
that allows to group multiple logs together with the task's specific input and output. An openedTaskSpan
is passed toTask.do_run
. Since aTaskSpan
is aTracer
ado_run
implementation can pass this instance on torun
methods of sub-tasks.
The following diagram illustrates their relationship:
Each of these concepts is implemented in form of an abstract base class and the Intelligence Layer provides
several concrete implementations that store the actual traces in different backends. For each backend each of the
three abstract classes Tracer
, Span
and TaskSpan
needs to be implemented. Here only the top-level
Tracer
-implementations are listed:
- The
NoOpTracer
can be used when tracing information shall not be stored at all. - The
InMemoryTracer
stores all traces in an in memory data structure and is most helpful in tests or Jupyter notebooks. - The
FileTracer
stores all traces in a json-file. - The
OpenTelemetryTracer
uses an OpenTelemetryTracer
to store the traces in an OpenTelemetry backend.
An important part of the Intelligence Layer is tooling that helps to evaluate custom tasks. Evaluation helps to measure how well the implementation of a task performs given real world examples. The outcome of an entire evaluation process is an aggregated evaluation result that consists out of metrics aggregated over all examples.
The evaluation process helps to:
- optimize a task's implementation by comparing and verifying if changes improve the performance.
- compare the performance of one implementation of a task with that of other (already existing) implementations.
- compare the performance of models for a given task implementation.
- verify how changes to the environment (new model version, new finetuning version) affect the performance of a task.
The basis of an evaluation is a set of examples for the specific task-type to be evaluated. A single Example
consists of:
- an instance of the
Input
for the specific task and - optionally an expected output that can be anything that makes sense in context of the specific evaluation (e.g. in case of classification this could contain the correct classification result, in case of QA this could contain a golden answer, but if an evaluation is only about comparing results with other results of other runs this could also be empty)
To enable reproducibility of evaluations datasets are immutable. A single dataset can be used to evaluate all
tasks of the same type, i.e. with the same Input
and Output
types.
The Intelligence Layer supports different kinds of evaluation techniques. Most important are:
- Computing absolute metrics for a task where the aggregated result can be compared with results of previous result in a way that they can be ordered. Text classification could be a typical use case for this. In that case the aggregated result could contain metrics like accuracy which can easily compared with other aggregated results.
- Comparing the individual outputs of different runs (all based on the same dataset) in a single evaluation process and produce a ranking of all runs as an aggregated result. This technique is useful when it is hard to come up with an absolute metrics to evaluate a single output, but it is easier to compare two different outputs and decide which one is better. An example use case could be summarization.
To support these techniques the Intelligence Layer differentiates between 3 consecutive steps:
- Run a task by feeding it all inputs of a dataset and collecting all outputs
- Evaluate the outputs of one or several runs and produce an evaluation result for each example. Typically a single run is evaluated if absolute metrics can be computed and several runs are evaluated when the outputs of runs shall be compared.
- Aggregate the evaluation results of one or several evaluation runs into a single object containing the aggregated metrics. Aggregating over several evaluation runs supports amending a previous comparison result with comparisons of new runs without the need to re-execute the previous comparisons again.
The following table shows how these three steps are represented in code:
Step | Executor | Custom Logic | Repository |
---|---|---|---|
1. Run | Runner |
Task |
RunRepository |
2. Evaluate | Evaluator |
EvaluationLogic |
EvaluationRepository |
3. Aggregate | Aggregator |
AggregationLogic |
AggregationRepository |
Columns explained
- "Executor" lists concrete implementations provided by the Intelligence Layer.
- "Custom Logic" lists abstract classes that need to be implemented with the custom logic.
- "Repository" lists abstract classes for storing intermediate results. The Intelligence Layer provides different implementations for these. See the next section for details.
During an evaluation process a lot of intermediate data is created before the final aggregated result can be produced. To avoid that expensive computations have to be repeated if new results are to be produced based on previous ones all intermediate results are persisted. For this the different executor-classes make use of repositories.
There are the following Repositories:
- The
DatasetRepository
offers methods to manage datasets. TheRunner
uses it to read allExample
s of a dataset and feeds them to theTask
. - The
RunRepository
is responsible for storing a task's output (in form of anExampleOutput
) for eachExample
of a dataset which are created when aRunner
runs a task using this dataset. At the end of a run aRunOverview
is stored containing some metadata concerning the run. TheEvaluator
reads these outputs given a list of runs it should evaluate to create an evaluation result for eachExample
of the dataset. - The
EvaluationRepository
enables theEvaluator
to store the evaluation result (in form of anExampleEvaluation
) for each example along with anEvaluationOverview
. TheAggregator
uses this repository to read the evaluation results. - The
AggregationRepository
stores theAggregationOverview
containing the aggregated metrics on request of theAggregator
.
The following diagrams illustrate how the different concepts play together in case of the different types of evaluations.
Process of an absolute Evaluation- The
Runner
reads theExample
s of a dataset from theDatasetRepository
and runs aTask
for eachExample.input
to produceOutput
s. - Each
Output
is wrapped in anExampleOutput
and stored in theRunRepository
. - The
Evaluator
reads theExampleOutput
s for a given run from theRunRepository
and the correspondingExample
from theDatasetRepository
and uses theEvaluationLogic
to compute anEvaluation
. - Each
Evaluation
gets wrapped in anExampleEvaluation
and stored in theEvaluationRepository
. - The
Aggregator
reads allExampleEvaluation
s for a given evaluation and feeds them to theAggregationLogic
to produce aAggregatedEvaluation
. - The
AggregatedEvalution
is wrapped in anAggregationOverview
and stored in theAggregationRepository
.
The next diagram illustrates the more complex case of a relative evaluation.
Process of a relative Evaluation- Multiple
Runner
s read the same dataset and produce the correspondingOutput
s for differentTask
s. - For each run all
Output
s are stored in theRunRepository
. - The
Evaluator
gets as input previous evaluations (that were produced on basis of the same dataset, but by differentTask
s) and the new runs of the current task. - Given the previous evaluations and the new runs the
Evaluator
can read theExampleOutput
s of both the new runs and the runs associated to previous evaluations, collect all that belong to a singleExample
and pass them along with theExample
to theEvaluationLogic
to compute anEvaluation
. - Each
Evaluation
gets wrapped in anExampleEvaluation
and is stored in theEvaluationRepository
. - The
Aggregator
reads allExampleEvaluation
from all involved evaluations and feeds them to theAggregationLogic
to produce aAggregatedEvaluation
. - The
AggregatedEvalution
is wrapped in anAggregationOverview
and stored in theAggregationRepository
.