v0.7.0
SebastianNiehusTNG
released this
28 Mar 10:23
·
333 commits
to main
since this release
Overview
- Refactoring in Evaluation
- Many changes to Evaluation repository structure and renaming to make the overall handling more intuitive and consistent,
- New How-To’s and improved documentation
- Simplified repository access via data selection methods
- Better text highlighting
- Better tracer viewer integration
- Displaying
InMemoryTracer
objects in a jupyter notebook will load them into an active trace viewer.
- Displaying
Breaking Changes
- breaking change:
FScores
are now correctly exposed asFScores
and no longer asRougeScores
- breaking change:
HuggingFaceAggregationRepository
andHuggingFaceDatasetRepository
now consistently follow the same folder structure asFileDatasetRepository
when creating data sets. This means that datasets will be stored in a folder datasets and additional sub-folders named according to the respective dataset ID. - breaking change: Split
run_repository
intofile_run_repository
,in_memory_run_repository
. - breaking change: Split
evaluation_repository
intoargilla_evaluation_repository
,file_evaluation_repository
andin_memory_evaluation_repository
- breaking change: Split
dataset_repository
into file_dataset_repository andin_memory_dataset_respository
- breaking change: Split
aggregation_respository
intofile_aggragation_repository
andin_memory_aggregation_repository
- breaking change: Renamed
evaluation/run.py
toevaluation/run_evaluator.py
- breaking change: Split
evaluation/domain
and distribute it across aggregation, evaluation, dataset and run packages. - breaking change: Split
evaluation/argilla
and distribute it across aggregation and evaluation packages. - breaking change: Split evaluation into separate dataset, run, evaluation and aggregation packages.
- breaking change: Split
evaluation/hugging_face.py
into dataset and aggregation repository files indata_storage
package. - breaking change:
create_dataset
now returns the newDataset
type instead of a dataset ID. - breaking change: Consistent naming for repository root directories when creating evaluations or aggregations: .../eval → .../evaluations and .../aggregation → aggregations.
- breaking change: Core tasks not longer provide defaults for the applied models.
- breaking change: Methods returning entities from repositories now return the results ordered by their IDs.
- breaking change: Renamed crashed_during_eval_count to crashed_during_evaluation_count in AggregationOverview.
- breaking change: Renamed
create_evaluation_dataset
toinitialize_evaluation
inEvaluationRepository
. - breaking change: Renamed
to_explanation_response
toto_explanation_request
inExplainInput
. - breaking change: Removed
TextHighlight::text
in favor of TextHighlight::start andTextHighlight::end
- breaking change: Removed
IntelligenceApp
andIntelligenceStarterApp
- breaking change:
RetrieverBasedQa
uses nowMultiChunkQa
instead of generic taskSingleChunkQa
- breaking change:
EvaluationRepository::failed_example_evaluations
no longer abstract - breaking change:
- Elo calculation simplified:
Payoff
from elo package has been removed PayoffMatrix
from elo package renamed toMatchOutcome
SingleChunkQa
uses logit_bias to promote not answering for German
- Elo calculation simplified:
- breaking change: Remove
ChunkOverlap
task. - breaking change: Rename
Chunk
toTextChunk
. - breaking change: Rename
ChunkTask
toChunk
. - breaking change: Rename
EchoTask
toEcho
. - breaking change: Rename
TextHighlightTask
toTextHighlight
- breaking change: Rename
ChunkOverlaptTask
toChunkOverlap
New Features
Aggregation:
- feature:
InstructComparisonArgillaAggregationLogic
uses full evaluation set instead of sample for aggregation
Documentation
- feature: Added How-To’s (linked in the README):
- how to define a task
- how to implement a task
- how to create a dataset
- how to run a task on a dataset
- how to perform aggregation
- how to evaluate runs
- feature: Restructured and cleaned up README for more conciseness.
- feature: Add illustrations to Concepts.md.
- feature: Added tutorial for adding task to a FastAPI app (linked in README).
- feature: Improved and added various DocStrings.
- feature: Added a README section about the client URL.
- feature: Add python naming convention to README
Classify
- feature:
PromptBasedClassify
now supports changing of the prompt instruction via the instruction parameter. - feature: Add default model for
PromptBasedClassify
- feature: Add default task for
PromptBasedesClassify
Evaluation
- feature: All repositories will return a
ValueError
when trying to access a dataset that does not exist while also trying to access an entry of the dataset. If only the dataset is retrieved, it will return None. - feature:
ArgillaEvaluationRepository
now handles failed evaluations. - feature: Added
SingleHuggingfaceDatasetRepository
. - feature: Added
HighlightCoverageGrader
. - feature: Added
LanguageMatchesGrader
. - feature: Added prettier default printing behavior of repository entities by providing overloads to
__str__
and__repr__
methods. - feature: Added abstract
HuggingFace
repository base-class. - feature: Refactoring of
HuggingFace
repository - feature: Added
HuggingFaceAggregationRepository
. - feature: Added template method to individual repository
- feature: Added Dataset model to dataset repository. This allows to store a short descriptive name for the dataset for easier identification.
- feature:
SingleChunkQa
internally now uses the same model inTextHighlight
by default. - feature:
MeanAccumulator
tracks standard deviation and standard error. - feature:
EloCalculator
now updates ranking after each match. - feature: Add data selection methods to repositories:
AggregationRepository::aggregation_overviews
EvaluationRepository::run_overviews
EvaluationRepository::run_overview_ids
EvaluationRepository::example_output
EvaluationRepository::example_outputs
EvaluationRepository::example_output_ids
EvaluationRepository::example_trace
EvaluationRepository::example_tracer
RunRepository::run_overviews
RunRepository::run_overview_ids
RunRepository::example_output
RunRepository::example_outputs
RunRepository::example_output_ids
RunRepository::example_trace
RunRepository::example_tracer
- feature:
Evaluator
continues in case of no successful outputs
Q & A
- feature: Define default parameters for
LongContextQa
,SingleChunkQa
- feature: Define default task for
RetrieverBasedQa
- feature: Define default model for
KeyWordExtract
,MultiChunkQa
- feature: Improved focus of highlights in
TextHighlight
tasks. - feature: Added filtering for
TextHighlight
tasks. - feature: Introduce
logit_bias
toSingleChunkQa
Summarize
- feature: Added
RecursiveSummarizeInput
. - feature: Define defaults for
SteerableSingleChunkSummarize
,SteerableLongContexSummarize
,RecursiveSummarize
Tracer
- feature: Added better trace viewer integration:
- Added trace storage to trace viewer server
- Added submit_to_tracer_viewer method to
InMemoryTracer
- UI and navigation improvements for trace viewer
- Add exception handling for tracers during log entry writing
Others
- feature: The following classes are now exposed:
DocumentChunk
MultipleChunkQaOutput
Subanswer
- feature: Simplified internal imports.
- feature: Stream lining of
__init__
-parameters of all tasks- Sub-tasks are typically exposed as
__init__
-parameters with sensible defaults. - Defaults for non-trivial parameters like models or tasks are defined in
__init__
while the default parameter is None. - Instead of exposing parameters that are passed on to sub-tasks the sub-task themselves are exposed.
- Sub-tasks are typically exposed as
- feature: Update supported models
Fixes
- fix: Fixed exception handling in language detection of
LanguageMatchesGrader
. - fix: Fixed a bug that could lead to cut-off highlight ranges in
TextHighlight
tasks. - fix: Fixed
list_ids
methods to usepath_to_str
- fix: Disallow traces without end in the trace viewer
- fix:
ArgillaClient
now correctly uses provided API-URL instead of hard-coded localhost
Full Changelog: v0.6.0...v0.7.0