Skip to content

harveyai/biglaw-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

BigLaw Bench

A new standard for evaluating AI on legal & professional tasks.

Overview

BigLaw Bench is a comprehensive framework for evaluating the performance of large language models (LLMs) on complex, real-world legal tasks. Developed by Harvey's legal research team, BigLaw Bench aims to supplement existing benchmarks by focusing on tasks that mirror actual billable work performed by lawyers, providing a more accurate measure of an AI model's utility in professional legal settings.

Benchmarks

1. BigLaw Bench — Core

BigLaw Bench Core is a set of core tasks for benchmarking baseline legal problem-solving. Core tasks are organized into two primary categories, each encompassing several specific sub-task types:

Transactional Task Categories

  • Corporate Strategy & Advising
  • Drafting
  • Legal Research
  • Due Diligence
  • Risk Assessment & Compliance
  • Negotiation Strategy
  • Deal Management
  • Transaction Structuring
  • Regulatory & Advising

Litigation Task Categories

  • Analysis of Litigation Filings
  • Case Management
  • Drafting
  • Case Law Research
  • Transcript Analysis
  • Document Review and Analysis
  • Trial Preparations & Oral Argument

2. BigLaw Bench — Workflows

BigLaw Bench Workflows represent a set of composite legal tasks that are used to evaluate agentic systems. We currently provide benchmarks for:

SPA Deal Points

Evaluates the ability of LLM agents to extract a variety of standard deal points from a dataset of Share Purchase Agreements (SPAs).

Evaluation Methodology

Each task in BigLaw Bench is assessed using custom-designed rubrics that measure:

  • Answer Quality: Evaluates the completeness, accuracy, and appropriateness of the model's response based on specific criteria essential for effective task completion.
  • Source Reliability: Assesses the model's ability to provide verifiable and correctly cited sources for its assertions, enhancing trust and facilitating validation.

Scores are calculated by combining positive points for meeting task requirements and negative points for errors or missteps (e.g. hallucinations). The final answer score represents: What % of a lawyer-quality work product does the model complete for the user?

Data Samples

Sample tasks and grading rubrics can be found at the links below.

  1. BLB-core: here
  2. BLB-workflows-spa: here

For access to the full dataset and additional resources, please contact Harvey directly.

Credits

Julio Pereyra, Elizabeth Lebens, Matthew Guillod, Laura Toulme, Cameron MacGregor, David Murdter, Karl de la Roche, Emilie McConnachie, Jeremy Pushkin, Rina Kim, Aaron Chan, Jenny Pan, Boling Yang, Nan Wu, Niko Grupen, Lauren Oh, Aatish Nayak, Gabriel Pereyra

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published