- Introduce
Benchmark
andStudioBenchmark
Benchmark
allows you to evaluate and compare the performance of differentTask
s with a fixed evaluation logic, aggregation logic andDataset
.- Add
how_to_execute_a_benchmark.ipynb
to how-tos - Add
studio.ipynb
to notebooks to show how one can debug aTask
with Studio
- Introduce
BenchmarkRepository
andStudioBenchmarkRepository
- Add
create_project
bool toStudioClient.__init__()
to enable users to automatically create their Studio projects - Add progressbar to the
Runner
to be able to track theRun
- Add
StudioClient.submit_benchmark_lineages
function and include it inStudioClient.submit_benchmark_execution
- Add method
DocumentIndexClient.chunks()
for retrieving all text chunks of a document. - Add metadata filter
FilterOps.IS_NULL
, that allows to filter fields based on whether their value is null.
- The Document Index
SearchQuery
now correctly allows searches with a negativemin_score
.
...
- The env variable
POSTGRES_HOST
is split intoPOSTGRES_HOST
andPOSTGRES_PORT
. This affects all classes interacting with Studio and theInstructionFinetuningDataRepository
. - The following env variables now need to be set (previously pointed to defaults)
CLIENT_URL
- URL of your inference stackDOCUMENT_INDEX_URL
- URL of the document index
- You can now customise the embedding model when creating an index using the
DocumentIndexClient
. - You can now use the
InstructableEmbed
embedding strategy when creating an index using theDocumentIndexClient
. See thedocument_index.ipynb
notebook for more information and an example.
- The way you configure indexes in the
DocumentIndexClient
has changed. See thedocument_index.ipynb
notebook for more information.- The
EmbeddingType
alias has been renamed toRepresentation
to better align with the underlying API. - The
embedding_type
field has been removed from theIndexConfiguration
class. You now configure embedding-related parameters via theembedding
field. - You now always need to specify an embedding model when creating an index. Previously, this was always
luminous-base
.
- The
- Dependency updates
- Add support for Llama3InstructModel in PromptBasedClassify
- Add TextControl to 'to_instruct_prompt' for instruct models
- Add 'attention_manipulation_with_text_controls.ipynb' to tutorial notebooks
- Introduced
InstructionFinetuningDataHandler
to provide methods for storing, retrieving and updating finetuning data samples given anInstructionFinetuningDataRepository
. Also has methods for filtered sample retrieval and for dataset formatting. - Introduced
InstructionFinetuningDataRepository
for storing and retrieving finetuning samples. Comes in two implementations:PostgresInstructionFinetuningDataRepository
to work with data stored in a Postgres database.FileInstructionFinetuningDataRepository
to work with data stored in the local file-system.
- Compute precision, recall and f1-score by class in
SingleLabelClassifyAggregationLogic
- Add submit_dataset function to StudioClient
- Add
how_to_upload_existing_datasets_to_studio.ipynb
to how-tos
- Add
- Improved some docstring inconsistencies across the codebase and switched the docstring checker to pydoclint.
- Add support for stages and files in Data client.
- Add more in-depth description for
MiltipleChunRetrieverQaOutput
andExpandChunks
- Data repository media types now validated with a function instead of an Enum.
- Update names of
pharia-1
models to lowercase, aligning with fresh deployments of the api-scheduler.
- Add Catalan and Polish support to
DetectLanguage
. - Add utility function
run_is_already_computed
toRunner
to check if a run with the given metadata has already been computed.- The
parameter_optimization
notebook describes how to use therun_is_already_computed
function.
- The
- The default
max_retry_time
for theLimitedConcurrencyClient
is now set to 3 minutes from a day. If you have long-running evaluations that need this, you can re-set a long retry time in the constructor.
- You can now specify a
hybrid_index
when creating an index for the document index to use hybrid (semantic and keyword) search. min_score
andmax_results
are now optional parameters inDocumentIndexClient.SearchQuery
.k
is now an optional parameter inDocumentIndexRetriever
.- List all indexes of a namespace with
DocumentIndexClient.list_indexes
. - Remove an index from a namespace with
DocumentIndexClient.delete_index
. ChatModel
now inherits fromControlModel
. Although we recommend to use the new chat interface, you can use thePharia1ChatModel
with tasks that rely onControlModel
now.
DocumentIndexClient
now properly setschunk_overlap
when creating an index configuration.
- The default model for
Llama3InstructModel
is nowllama-3.1-8b-instruct
instead ofllama-3-8b-instruct
. We also removed the llama3.0 models from the recommended models of theLlama3InstructModel
. - The default value of
threshold
in theDocumentIndexRetriever
has changed from0.5
to0.0
. This accommodates fusion scoring for searches over hybrid indexes.
- Remove cap for
max_concurrency
inLimitedConcurrencyClient
. - Introduce abstract
LanguageModel
class to integrate with LLMs from any API- Every
LanguageModel
supports echo to retrieve log probs for an expected completion given a prompt
- Every
- Introduce abstract
ChatModel
class to integrate with chat models from any API- Introducing
Pharia1ChatModel
for usage with pharia-1 models. - Introducing
Llama3ChatModel
for usage with llama models.
- Introducing
- Upgrade
ArgillaWrapperClient
to use Argilla v2.x - (Beta) Add
DataClient
andStudioDatasetRepository
as connectors to Studio for submitting data. - Add the optional argument
generate_highlights
toMultiChunkQa
,RetrieverBasedQa
andSingleChunkQa
. This makes it possible to disable highlighting for performance reasons.
- Increase number of returned
log_probs
inEloQaEvaluationLogic
to avoid missing a valid answer
- Removed
DefaultArgillaClient
- Deprecated
Llama2InstructModel
- We needed to upgrade argilla-server image version from
argilla-server:v1.26.0
toargilla-server:v1.29.0
to maintain compatibility.- Note: We also updated our elasticsearch argilla backend to
8.12.2
- Note: We also updated our elasticsearch argilla backend to
- Updated
DocumentIndexClient
with support for metadata filters.- Add documentation for filtering to
document_index.ipynb
.
- Add documentation for filtering to
- Add
StudioClient
as a connector for submitting traces. - You can now specify a
chunk_overlap
when creating an index in the Document Index. - Add support for monitoring progress in the document index connector when embedding documents.
- TaskSpan now properly sets its status to
Error
on crash.
- Deprecate old Trace Viewer as the new
StudioClient
replaces it. This affectsTracer.submit_to_trace_viewer
.
- Update docstrings for 'calculate_bleu' in 'BleuGrader' to now correctly reflect float range from 0 to 100 for the return value.
- Reverted a bug introduced in
MultipleChunkRetrieverQa
text highlighting.
- Serialization and deserialization of
ExportedSpan
and itsattributes
now works as expected. PromptTemplate.to_rich_prompt
now always returns an empty list for prompt ranges that are empty.SingleChunkQa
no longer crashes if given an empty input and a specific prompt template. This did not affect users who used models provided incore
.- Added default values for
labels
andmetadata
forEvaluationOverview
andRunOverview
- In the
MultipleChunkRetrieverQa
, text-highlight start and end points are now restricted to within the text length of the respective chunk.
RunRepository.example_output
now returnsNone
and prints a warning when there is no associated record for the givenrun_id
instead of raising aValueError
.RunRepository.example_outputs
now returns an empty list and prints a warning when there is no associated record for the givenrun_id
instead of raising aValueError
.
Runner.run_dataset
can now be resumed after failure by setting theresume_from_recovery_data
flag toTrue
and callingRunner.run_dataset
again.- For
InMemoryRunRepository
basedRunner
s this is limited to runs that failed with an exception that did not crash the whole process/kernel. - For
FileRunRepository
basedRunners
even runs that crashed the whole process can be resumed. DatasetRepository.examples
now accepts an optional parameterexamples_to_skip
to enable skipping ofExample
s with the provided IDs.- Add
how_to_resume_a_run_after_a_crash
notebook.
- Remove unnecessary dependencies from IL
- Added default values for
labels
andmetadata
forPartialEvaluationOverview
- Add
eot_token
property toControlModel
and derived classes (LuminousControlModel
,Llama2InstructModel
andLlama3InstructModel
) and letPromptBasedClassify
use this property instead of a hardcoded string. - Introduce a new argilla client
ArgillaWrapperClient
. This uses theargilla
package as a connection to argilla and supports all question types that argilla supports in theirFeedbackDataset
. This includes text and yes/no questions. For more information about the questions, check their official documentation.- Changes to switch:
DefaultArgillaClient
->ArgillaWrapperClient
Question
->argilla.RatingQuestion
,options
->values
and it takes only a listField
->argilla.TextField
- Changes to switch:
- Add
description
parameter toAggregator.aggregate_evaluation
to allow individual descriptions without the need to create a newAggregator
. This was missing from the previous release. - Add optional field
metadata
toDataset
,RunOverview
,EvaluationOverview
andAggregationOverview
- Update
parameter_optimization.ipynb
to demonstrate usage of metadata****
- Update
- Add optional field
label
toDataset
,RunOverview
,EvaluationOverview
andAggregationOverview
- Add
unwrap_metadata
flag toaggregation_overviews_to_pandas
to enable inclusion of metadata in pandas export. Defaults to True.
- Reinitializing different
AlephAlphaModel
instances and retrieving their tokenizer should now consume a lot less memory. - Evaluations now raise errors if ids of examples and outputs no longer match. If this happens, continuing the evaluation would only produce incorrect results.
- Performing evaluations on runs with a different number of outputs now raises errors. Continuing the evaluation in this case would only lead to an inconsistent state.
- Remove the
Trace
class, as it was no longer used. - Renamed
example_trace
toexample_tracer
and changed return type toOptional[Tracer]
. - Renamed
example_tracer
tocreate_tracer_for_example
. - Replaced langdetect with lingua as language detection tool. This mean that old thresholds for detection might need to be adapted.
Lineages
now containTracer
for individualOutput
s.convert_to_pandas_data_frame
now also creates a column containing theTracer
s.run_dataset
now has a flagtrace_examples_individually
to createTracer
s for each example. Defaults to True.- Added optional
metadata
field toExample
.
- ControlModels throw a warning instead of an error in case a not-recommended model is selected.
- The
LimitedConcurrencyClient.max_concurrency
is now capped at 10, which is its default, as the underlyingaleph_alpha_client
does not support more currently. - ExpandChunk now works properly if the chunk of interest is not at the beginning of a very large document. As a consequence,
MultipleChunkRetrieverQa
now works better with larger documents and should return fewerNone
answers.
- We removed the
trace_id
as a concept from various tracing-related functions and moved them to acontext
. If you did not directly use thetrace_id
there is nothing to change.Task.run
no longer takes a trace id. This was a largely unused feature, and we revamped the trace ids for the traces.- Creating
Span
,TaskSpan
or logs no longer takestrace_id
. This is handled by the spans themselves, who now have acontext
that identifies them.Span.id
is therefore also removed. This can be accessed byspan.context.trace_id
, but has a different type.
- The
OpenTelemetryTracer
no longer logs a customtrace_id
into the attributes. Use the existing ids from its context instead. - Accessing a single trace from a
PersistentTracer.trace()
is no longer supported, as the user does not have access to thetrace_id
anyway. The function is now calledtraces
and returns all available traces for a tracer.
InMemoryTracer
and derivatives are no longerpydantic.BaseModel
. Use theexport_for_viewing
function to export a serializable representation of the trace.- We updated the graders to support python 3.12 and moved away from
nltk
-package:BleuGrader
now usessacrebleu
-package.RougeGrader
now uses therouge_score
-package.
- When using the
ArgillaEvaluator
, attempting to submit to a dataset, which already exists, will no longer work append to the dataset. This makes it more in-line with other evaluation concepts.- Instead of appending to an active argilla dataset, you now need to create a new dataset, retrieve it and then finally combine both datasets in the aggregation step.
- The
ArgillaClient
now has methodscreate_dataset
for less fault-ignoring dataset creation andadd_records
for performant uploads.
- Add support for Python 3.12
- Add
skip_example_on_any_failure
flag toevaluate_runs
(defaults to True). This allows to configure if you want to keep an example for evaluation, even if it failed for some run. - Add
how_to_implement_incremental_evaluation
. - Add
export_for_viewing
to tracers to be able to export traces in a unified format similar to OpenTelemetry.- This is not supported for the
OpenTelemetryTracer
because of technical incompatibilities.
- This is not supported for the
- All exported spans now contain the status of the span.
- Add
description
parameter toEvaluator.evaluate_runs
andRunner.run_dataset
to allow individual descriptions without the need to create a newEvaluator
orRunner
. - All models raise an error during initialization if an incompatible
name
is passed, instead of only when they are used. - Add
aggregation_overviews_to_pandas
function to allow for easier comparison of multiple aggregation overviews. - Add
parameter_optimization.ipynb
notebook to demonstrate the optimization of tasks by comparing different parameter combinations. - Add
convert_file_for_viewing
in theFileTracer
to convert the trace file format to the new (OpenTelemetry style) format and save as a new file. - All tracers can now call
submit_to_trace_viewer
to send the trace to the Trace Viewer.
- The document index client now correctly URL-encodes document names in its queries.
- The
ArgillaEvaluator
not properly supportsdataset_name
. - Update outdated
how_to_human_evaluation_via_argilla.ipynb
. - Fix bug in
FileSystemBasedRepository
causing spurious mkdir failure if the file actually exists. - Update broken README links to Read The Docs.
- Fix a broken multi-label classify example in the
evaluation
tutorial.
- Changed the behavior of
IncrementalEvaluator::do_evaluate
such that it now sends allSuccessfulExampleOutput
s todo_incremental_evaluate
instead of only the newSuccessfulExampleOutput
s.
- Add generic
EloEvaluationLogic
class for implementation of Elo evaluation use cases. - Add
EloQaEvaluationLogic
for Elo evaluation of QA runs, with optional later addition of more runs to an existing evaluation. - Add
EloAggregationAdapter
class to simplify using theComparisonEvaluationAggregationLogic
for different Elo use cases. - Add
elo_qa_eval
tutorial notebook describing the use of an (incremental) Elo evaluation use case for QA models. - Add
how_to_implement_elo_evaluations
how-to as skeleton for implementing Elo evaluation cases
ExpandChunks
-task is now fast even for very large documents
We did a major revamp of the ArgillaEvaluator
to separate an AsyncEvaluator
from the normal evaluation scenario.
This comes with easier to understand interfaces, more information in the EvaluationOverview
and a simplified aggregation step for Argilla that is no longer dependent on specific Argilla types.
Check the how-to for detailed information here
- rename:
AggregatedInstructComparison
toAggregatedComparison
- rename
InstructComparisonArgillaAggregationLogic
toComparisonAggregationLogic
- remove:
ArgillaAggregator
- the regular aggregator now does the job - remove:
ArgillaEvaluationRepository
-ArgillaEvaluator
now usesAsyncRepository
which extend existingEvaluationRepository
for the human-feedback use-case ArgillaEvaluationLogic
now usesto_record
andfrom_record
instead ofdo_evaluate
. The signature of theto_record
stays the same. TheField
andQuestion
are now defined in the logic instead of passed to theArgillaRepository
ArgillaEvaluator
now takes theArgillaClient
as well as theworkspace_id
. It inherits from the abstractAsyncEvaluator
and no longer hasevalaute_runs
andevaluate
. Instead it hassubmit
andretrieve
.EvaluationOverview
gets attributesend_date
,successful_evaluation_count
andfailed_evaluation_count
- rename:
start
is now calledstart_date
and no longer optional
- rename:
- we refactored the internals of
Evaluator
. This is only relevant if you subclass from it. Most of the typing and data handling is moved toEvaluatorBase
- Add
ComparisonEvaluation
for the elo evaluation to abstract from the Argilla record - Add
AsyncEvaluator
for human-feedback evaluation.ArgillaEvaluator
inherits from this.submit
pushes all evaluations to Argilla to label them- Add
PartialEvaluationOverview
to store the submission details. .retrieve
then collects all labelled records from Argilla and stores them in anAsyncRepository
.- Add
AsyncEvaluationRepository
to store and retrievePartialEvaluationOverview
. Also addedAsyncFileEvaluationRepository
andAsyncInMemoryEvaluationRepository
- Add
EvaluatorBase
andEvaluationLogicBase
for base classes for both async and synchronous evaluation.
- Improve description of using artifactory tokens for installation of IL
- Change
confusion_matrix
inSingleLabelClassifyAggregationLogic
such that it can be persisted in a file repository
AlephAlphaModel
now supports acontext_size
-property- Add new
IncrementalEvaluator
for easier addition of runs to existing evaluations without repeated evaluation.- Add
IncrementalEvaluationLogic
for use inIncrementalEvaluator
- Add
Initial stable release
With the release of version 1.0.0 there have been introduced some new features but also some breaking changes you should be aware of. Apart from these changes, we also had to reset our commit history, so please be aware of this fact.
- The TraceViewer has been exported to its own repository and can be accessed via the artifactory here
HuggingFaceDatasetRepository
now has a parameter caching, which caches examples of a dataset once loaded.True
as default value- set to
False
for non-breaking-change
- Introduction of
LLama2InstructModel
allows support of the LLama2-models: llama-2-7b-chat
llama-2-13b-chat
llama-2-70b-chat
- Introduction of
LLama3InstructModel
allows support of the LLama2-models: llama-3-8b-instruct
llama-3-70b-instruct
DocumentIndexClient
has been enhanced with the following set of features:
create_index
- feature
index_configuration
assign_index_to_collection
delete_index_from_collection
list_assigned_index_names
ExpandChunks
-task now caches chunked documents by IDDocumentIndexRetriever
now supportsindex_name
Runner.run_dataset
now has a configurable number of workers viamax_workers
and defaults to the previous value, which is 10.- In case a
BusyError
is raised during acomplete
theLimitedConcurrencyClient
will retry untilmax_retry_time
is reached.
HuggingFaceRepository
no longer is a dataset repository. This also means thatHuggingFaceAggregationRepository
no longer is a dataset repository.- The input parameter of the
DocumentIndex.search()
-function now has been renamed fromindex
toindex_name