LRCBench

Evals meant to evaluate language models' ability to reason over long contexts.

Currently, we support 3 settings with similar objectives:

Coding: The model is given a coding question and a set of helper functions. It must select which (3) helper functions solve the problem.
Transaction Matching: The model is given a set of accounting records, from which all but one can be paired according to the following criteria: a pair of records have opposite sign / same magnitude amounts, the same counterparty, and a date within 4 days. It must return the unpaired record.
2-cycle multiplication: The model is given a set of 2-cycles (from undergraduate group theory), and is asked to return the product of the two cycles in simplified form. (Note: Currently, LM models perform surprisingly poorly on this task.)

Benchmarks

Notes

In order to make these correctnesses more robust to noise, we stop after 2 consecutive runs that fail to reach 0.6.
We also increased the sample size for the coding benchmark, as it was a tad noisy.
I moved from gpt-4/gpt-4-32k to gpt-4-32k at all sizes, as gpt-4 is inferior to gpt-4-32k even up to 4k context. This improved the results for "gpt-4."

Definitions

Score: The largest haystack size before the two consecutive runs that fail to reach 0.6.
Effective Context Window: The average token length (using the GPT-4 tokenizer) of the haystack at the score.
Size at First Failure: The highest haystack size such that the model earned a full score up to and including that size.
Correctnesses: Int(correctness_percentage*100) for each haystack size up to the score.

Helper Function Invocation - Data Science

Sample size: 30 problems

LM	Score	Effective Context Window	Correctnesses
claude-3-opus-20240229	50	8933	7776645
claude-3-5-sonnet-20240620	30	6393	87655
gpt-4-32k	30	6393	76655
gemini-1.5-pro	20	4900	6655
gemini-1.5-flash	20	4900	7654
gpt-4o-2024-08-06	10	1123	655
gpt-4o-mini	10	1123	744
gpt-4-turbo	0	0	5

Transaction Matching

Sample size: 10 problems

LM	Score	Size at First Failure	Effective Context Window	Correctnesses
Claude-3-5-sonnet-20240620	400	80	13798	TTTTTTTTTT8T7863852
Gemini-1.5-pro	200	_*	_	TTTT78899877783844
Claude-3-opus-20240229	160	60	4982	9TT89TTT98787723
GPT-4(-32k)	120	20	3882	TTT98987966955
Gemini-1.5-flash	40	_	_	TTT6623
GPT-4-turbo	40	10	1388	TTT7652
GPT-4o-2024-08-06	30	10	1113	TTT823
GPT-4o-mini	20	5	837	TT853
gemini-1.5-flash-8b-exp-0827	10	5	_	TT42

Previously 40 (gpt-4 only), now 100 (gpt-4-32k)
I can no longer get Gemini models to run without getting failure responses from them (possibly due to safety filters?)

2-cycle multiplication (Possibly implemented poorly?)

Sample size: 10 problems

LM	Score	Size at First Failure	Correctnesses
o1-mini	30	2	TT9TT897977666
o1-preview	20	0	7688877666623
Claude-3-5-sonnet-20240620	3	1	T632
Claude-3-opus-20240229	3	1	T810
GPT-4-turbo	3	1	T752
GPT-4o-2024-08-06	3	1	T632
GPT-4	2	1	T54
GPT-4o-mini	2	1	T34
Gemini-1.5-pro	0	0	0
Gemini-1.5-flash	0	0	0

Previously 2 (gpt-4 only), now 1 (gpt-4-32k)

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
src		src
visuals		visuals
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LRCBench

Benchmarks

Notes

Definitions

Helper Function Invocation - Data Science

Transaction Matching

2-cycle multiplication (Possibly implemented poorly?)

About

Releases

Packages

Languages

License

JoshuaPurtell/LRCBench

Folders and files

Latest commit

History

Repository files navigation

LRCBench

Benchmarks

Notes

Definitions

Helper Function Invocation - Data Science

Transaction Matching

2-cycle multiplication (Possibly implemented poorly?)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages