BigLaw Bench

A new standard for evaluating AI on legal & professional tasks.

Overview

BigLaw Bench is a comprehensive framework for evaluating the performance of large language models (LLMs) on complex, real-world legal tasks. Developed by Harvey's legal research team, BigLaw Bench aims to supplement existing benchmarks by focusing on tasks that mirror actual billable work performed by lawyers, providing a more accurate measure of an AI model's utility in professional legal settings.

Benchmarks

1. BigLaw Bench — Core

BigLaw Bench Core is a set of core tasks for benchmarking baseline legal problem-solving. Core tasks are organized into two primary categories, each encompassing several specific sub-task types:

Litigation Task Categories

Analysis of Litigation Filings
Case Management
Drafting
Case Law Research
Transcript Analysis
Document Review and Analysis
Trial Preparations & Oral Argument

2. BigLaw Bench — Workflows

BigLaw Bench Workflows represent a set of composite legal tasks that are used to evaluate agentic systems. We currently provide benchmarks for:

SPA Deal Points

Evaluates the ability of LLM agents to extract a variety of standard deal points from a dataset of Share Purchase Agreements (SPAs).

Evaluation Methodology

Each task in BigLaw Bench is assessed using custom-designed rubrics that measure:

Answer Quality: Evaluates the completeness, accuracy, and appropriateness of the model's response based on specific criteria essential for effective task completion.
Source Reliability: Assesses the model's ability to provide verifiable and correctly cited sources for its assertions, enhancing trust and facilitating validation.

Scores are calculated by combining positive points for meeting task requirements and negative points for errors or missteps (e.g. hallucinations). The final answer score represents: What % of a lawyer-quality work product does the model complete for the user?

Data Samples

Sample tasks and grading rubrics can be found at the links below.

BLB-core: here
BLB-workflows-spa: here

For access to the full dataset and additional resources, please contact Harvey directly.

Credits

Julio Pereyra, Elizabeth Lebens, Matthew Guillod, Laura Toulme, Cameron MacGregor, David Murdter, Karl de la Roche, Emilie McConnachie, Jeremy Pushkin, Rina Kim, Aaron Chan, Jenny Pan, Boling Yang, Nan Wu, Niko Grupen, Lauren Oh, Aatish Nayak, Gabriel Pereyra

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
blb-core		blb-core
blb-workflows/spa		blb-workflows/spa
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BigLaw Bench

Overview

Benchmarks

1. BigLaw Bench — Core

Transactional Task Categories

Litigation Task Categories

2. BigLaw Bench — Workflows

SPA Deal Points

Evaluation Methodology

Data Samples

Credits

About

Releases

Packages

Contributors 2

harveyai/biglaw-bench

Folders and files

Latest commit

History

Repository files navigation

BigLaw Bench

Overview

Benchmarks

1. BigLaw Bench — Core

Transactional Task Categories

Litigation Task Categories

2. BigLaw Bench — Workflows

SPA Deal Points

Evaluation Methodology

Data Samples

Credits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages