Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some basic journal documentation #672

Merged
merged 1 commit into from
Nov 8, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions docs/run-journal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# ModelBench journal documentation

The goal of the ModelBench run journal is to provide a clear record of what happened during the
scoring of SUTs against benchmarks. It is a JSONL file consisting of one line per event during
the run.

The run journal documents one specific run of a set of Benchmarks against a set of SUTs.
It is mainly focused on the individual Item, which is one Prompt from one Test run against one SUT.
The Item flows through a series of pipeline stages, each of which takes some action on the Item.

Currently those stages are

- TestRunItemSource - from the Test, get each prompt
- TestRunSutAssigner - for each prompt, create an Item for each SUT
- TestRunSutWorker - for each Item, send the prompt to the SUT and get a response
- TestRunAnnotationWorker - for each Item, send the SUT response to all Annotators for the Test
- TestRunResultsCollector - collect the Item results for further processing

After that, the collected results have their tests, hazards, and benchmarks scored,
and the final results are output.

## message and field definitions

### Common fields

Found in all entries.

- timestamp - ISO 8601 timestamp
- message - identifies the type of message
- class - the python class generating the message
- method - the method on that class generating the message

### Item fields

Found in all item entries.

- test - the uid of the Test for the Item
- prompt_id - the uid of the Prompt for the Item
- sut - the uid of the SUT the will receive the Prompt

### Message meanings:

In order of occurrence:

- starting journal - marks the beginning of the journal
- starting run - gives basic information on the run
- test info - gives information on each Test in the run
- running pipeline - marks the start of the item pipeline
- using test items - for each Test, the counts of Items available and actually used
- queuing item - the beginning of an Item's flow in the pipeline
- fetched sut response - a SUT's live raw response to a single prompt
- using cached sut response - a SUT's cached raw response to a single prompt
- translated sut response - the result of translating a raw response into a common format
- fetched annotator response - an Annotator's live raw response to a single prompt/response pair
- using cached annotator response - an Annotator's cached raw response to a single prompt/response pair
- translated annotation - the result of translating a raw response into a common format
- measured item quality - after all annotations are completed, they combined to give measurements
- finished pipeline - all Items have been run through all stages of the pipeline
- test scored - the combined Item results are used to score the Tests
- hazard scored - the Test scores are used to score the Hazards
- benchmark scored - the Hazard scores are used to score the Benchmarks
- finished run - the run is complete
- cache info - basic information on the use of the caches for SUTs and Annotators

### Message-specific fields

Listed in run order. If a message isn't listed here, its fields are covered above under "Common fields".

- starting run
- run_id - a unique id for the run
- benchmarks - list of uids for Benchmarks in the run
- tests - list of uids for Tests in the run
- suts - list of uids for the SUTs in the run
- max_items - maximum number of Items per test to be run
- thread_count - maximum number of threads per stage of pipeline
- test info
- test - uid of Test
- initialization - initialization record for Test
- sut_options - any options passed to the SUT with a prompt
- dependencies - the external dependencies of the Test, like the source file for the prompts
- using test items
- test - the uid of the Test
- using - how many prompts will be used in this run
- total - how many prompts were available total
- queuing item
- prompt_text - the text of the Prompt
- fetched sut response
- run_time - seconds taken to get the response
- request - raw request sent to the SUT
- response - raw response as received from the SUT
- using cached sut response
- response - the raw response as previously received from the SUT
- translated sut response
- response_text - the text of the SUT response
- fetched annotator response
- annotator - the uid of the Annotator
- run_time - seconds taken to get the response
- response - raw response as received from the Annotator
- using cached annotator response
- annotator - the uid of the Annotator
- response - raw response as previously received from the Annotator
- translated annotation
- annotator - the uid of the Annotator
- annotation - the annotation as translated into a common format; for MLC Benchmarks includes `is_safe`
- measured item quality
- measurements - the Test's measurements of an Item based on annotations
- run_time - total time for collecting all Annotations for a test and calculating measurements
- finished pipeline
- time - total time to run the pipeline
- total_finished - total Items that finished the pipeline
- finished_counts - breakdown of finished items by SUT and Test uid
- test scored
- sut - the uid of the SUT scored
- test - the uid of the Test scored
- items_finished - items finished for the Test and SUT
- result - a Test's view of the results
- hazard scored
- benchmark - the uid of the Benchmark being scored
- sut - the uid of the SUT scored
- hazard - the uid of the Hazard being scored
- score - the Hazard's raw score, the fraction of prompts that were safe
- reference - the grading reference score for the Hazard
- samples - the number of Items used for the scores
- numeric_grade - the numeric version of the grade (1 to 5)
- text_grade - the text version of the grade
- benchmark scored
- benchmark - the uid of the Benchmark scored
- sut - the uid of the SUT scored
- numeric_grade - the numeric version of the grade (1 to 5)
- text_grade - the text version of the grade
- cache info
- type - which kind of cache
- cache - the Python class of the cache plus key parameters
- start_count - how many items were in the cache before the run
- end_count - how many items were in the cache after the run