Skip to content

v0.7.0

Compare
Choose a tag to compare
@SebastianNiehusTNG SebastianNiehusTNG released this 28 Mar 10:23
· 333 commits to main since this release

Overview

  • Refactoring in Evaluation
    • Many changes to Evaluation repository structure and renaming to make the overall handling more intuitive and consistent,
  • New How-To’s and improved documentation
  • Simplified repository access via data selection methods
  • Better text highlighting
  • Better tracer viewer integration
    • Displaying InMemoryTracer objects in a jupyter notebook will load them into an active trace viewer.

Breaking Changes

  • breaking change: FScores are now correctly exposed as FScores and no longer as RougeScores
  • breaking change: HuggingFaceAggregationRepository and HuggingFaceDatasetRepository now consistently follow the same folder structure as FileDatasetRepository when creating data sets. This means that datasets will be stored in a folder datasets and additional sub-folders named according to the respective dataset ID.
  • breaking change: Split run_repository into file_run_repository, in_memory_run_repository.
  • breaking change: Split evaluation_repository into argilla_evaluation_repository, file_evaluation_repository and in_memory_evaluation_repository
  • breaking change: Split dataset_repository into file_dataset_repository and in_memory_dataset_respository
  • breaking change: Split aggregation_respository into file_aggragation_repository and in_memory_aggregation_repository
  • breaking change: Renamed evaluation/run.py to evaluation/run_evaluator.py
  • breaking change: Split evaluation/domain and distribute it across aggregation, evaluation, dataset and run packages.
  • breaking change: Split evaluation/argilla and distribute it across aggregation and evaluation packages.
  • breaking change: Split evaluation into separate dataset, run, evaluation and aggregation packages.
  • breaking change: Split evaluation/hugging_face.py into dataset and aggregation repository files in data_storage package.
  • breaking change: create_dataset now returns the new Dataset type instead of a dataset ID.
  • breaking change: Consistent naming for repository root directories when creating evaluations or aggregations: .../eval → .../evaluations and .../aggregation → aggregations.
  • breaking change: Core tasks not longer provide defaults for the applied models.
  • breaking change: Methods returning entities from repositories now return the results ordered by their IDs.
  • breaking change: Renamed crashed_during_eval_count to crashed_during_evaluation_count in AggregationOverview.
  • breaking change: Renamed create_evaluation_dataset to initialize_evaluation in EvaluationRepository.
  • breaking change: Renamed to_explanation_response to to_explanation_request in ExplainInput.
  • breaking change: Removed TextHighlight::text in favor of TextHighlight::start and TextHighlight::end
  • breaking change: Removed IntelligenceApp and IntelligenceStarterApp
  • breaking change: RetrieverBasedQa uses now MultiChunkQa instead of generic task SingleChunkQa
  • breaking change: EvaluationRepository::failed_example_evaluations no longer abstract
  • breaking change:
    • Elo calculation simplified: Payoff from elo package has been removed
    • PayoffMatrix from elo package renamed to MatchOutcome
    • SingleChunkQa uses logit_bias to promote not answering for German
  • breaking change: Remove ChunkOverlap task.
  • breaking change: Rename Chunk to TextChunk.
  • breaking change: Rename ChunkTask to Chunk .
  • breaking change: Rename EchoTask to Echo.
  • breaking change: Rename TextHighlightTask to TextHighlight
  • breaking change: Rename ChunkOverlaptTask to ChunkOverlap

New Features

Aggregation:

  • feature: InstructComparisonArgillaAggregationLogic uses full evaluation set instead of sample for aggregation

Documentation

  • feature: Added How-To’s (linked in the README):
    • how to define a task
    • how to implement a task
    • how to create a dataset
    • how to run a task on a dataset
    • how to perform aggregation
    • how to evaluate runs
  • feature: Restructured and cleaned up README for more conciseness.
  • feature: Add illustrations to Concepts.md.
  • feature: Added tutorial for adding task to a FastAPI app (linked in README).
  • feature: Improved and added various DocStrings.
  • feature: Added a README section about the client URL.
  • feature: Add python naming convention to README

Classify

  • feature: PromptBasedClassify now supports changing of the prompt instruction via the instruction parameter.
  • feature: Add default model for PromptBasedClassify
  • feature: Add default task for PromptBasedesClassify

Evaluation

  • feature: All repositories will return a ValueError when trying to access a dataset that does not exist while also trying to access an entry of the dataset. If only the dataset is retrieved, it will return None.
  • feature: ArgillaEvaluationRepository now handles failed evaluations.
  • feature: Added SingleHuggingfaceDatasetRepository.
  • feature: Added HighlightCoverageGrader.
  • feature: Added LanguageMatchesGrader.
  • feature: Added prettier default printing behavior of repository entities by providing overloads to __str__ and __repr__ methods.
  • feature: Added abstract HuggingFace repository base-class.
  • feature: Refactoring of HuggingFace repository
  • feature: Added HuggingFaceAggregationRepository.
  • feature: Added template method to individual repository
  • feature: Added Dataset model to dataset repository. This allows to store a short descriptive name for the dataset for easier identification.
  • feature: SingleChunkQa internally now uses the same model in TextHighlight by default.
  • feature: MeanAccumulator tracks standard deviation and standard error.
  • feature: EloCalculator now updates ranking after each match.
  • feature: Add data selection methods to repositories:
    • AggregationRepository::aggregation_overviews
    • EvaluationRepository::run_overviews
    • EvaluationRepository::run_overview_ids
    • EvaluationRepository::example_output
    • EvaluationRepository::example_outputs
    • EvaluationRepository::example_output_ids
    • EvaluationRepository::example_trace
    • EvaluationRepository::example_tracer
    • RunRepository::run_overviews
    • RunRepository::run_overview_ids
    • RunRepository::example_output
    • RunRepository::example_outputs
    • RunRepository::example_output_ids
    • RunRepository::example_trace
    • RunRepository::example_tracer
  • feature: Evaluator continues in case of no successful outputs

Q & A

  • feature: Define default parameters for LongContextQa, SingleChunkQa
  • feature: Define default task for RetrieverBasedQa
  • feature: Define default model for KeyWordExtract, MultiChunkQa
  • feature: Improved focus of highlights in TextHighlight tasks.
  • feature: Added filtering for TextHighlight tasks.
  • feature: Introduce logit_bias to SingleChunkQa

Summarize

  • feature: Added RecursiveSummarizeInput.
  • feature: Define defaults for SteerableSingleChunkSummarize, SteerableLongContexSummarize, RecursiveSummarize

Tracer

  • feature: Added better trace viewer integration:
    • Added trace storage to trace viewer server
    • Added submit_to_tracer_viewer method to InMemoryTracer
    • UI and navigation improvements for trace viewer
    • Add exception handling for tracers during log entry writing

Others

  • feature: The following classes are now exposed:
    • DocumentChunk
    • MultipleChunkQaOutput
    • Subanswer
  • feature: Simplified internal imports.
  • feature: Stream lining of __init__-parameters of all tasks
    • Sub-tasks are typically exposed as __init__-parameters with sensible defaults.
    • Defaults for non-trivial parameters like models or tasks are defined in __init__ while the default parameter is None.
    • Instead of exposing parameters that are passed on to sub-tasks the sub-task themselves are exposed.
  • feature: Update supported models

Fixes

  • fix: Fixed exception handling in language detection of LanguageMatchesGrader.
  • fix: Fixed a bug that could lead to cut-off highlight ranges in TextHighlight tasks.
  • fix: Fixed list_ids methods to use path_to_str
  • fix: Disallow traces without end in the trace viewer
  • fix: ArgillaClient now correctly uses provided API-URL instead of hard-coded localhost

Full Changelog: v0.6.0...v0.7.0